Kang-chen / data_pipeline

1 stars 5 forks source link

GSE152058 #13

Open Mimianmy opened 1 week ago

Mimianmy commented 1 week ago

Source ID:GSE152058

I have updated the code in dataextract.py to check for the existence of features.tsv.gz (or genes.tsv.gz), barcodes.tsv.gz, and matrix.mtx.gz files. If any of these files are missing, a FileNotFoundError is raised, and the program exits.

image

Bugs Detail The error now encountered is an IndexError: list index out of range. This occurred when the code tried to access a file that starts with the expected prefix 'GSM4601691_190807_zQ175_snRNA-B7' but failed to find any matching file. The error was caught and re-raised as a FileNotFoundError with the message: "No matching files found for prefix 'GSM4601691_190807_zQ175_snRNA-B7'. Exiting."

(Optional) Steps to Reproduce Step 1: Run the run_data_pipeline.sh script on the dataset GSE152058. Step 2: Wait for the script to reach the MTX processing step. Step 3: Observe the error during file merging in the Python script dataextract.py, where the prefix 'GSM4601691_190807_zQ175_snRNA-B7' does not match any files.

Kang-chen commented 5 days ago

@Mimianmy It's a good idea to check for the existence of the three required files. I have two suggestions that might help:

  1. Instead of raising a FileNotFoundError, you could consider not enforcing the presence of all files. We don't necessarily need all data to be imported perfectly—it's more important to handle what can be imported smoothly, while allowing certain data to be skipped if needed. For example, the structure and naming of files in this dataset can be quite confusing. In this case, the _coldata.tsv.gz file has barcodes starting from the second row of the first column, and the _rowdata.tsv.gz file starts with ENSEMBL IDs in the same pattern. This is not a common layout, so it may be better to either manually read these files or simply skip the dataset if it’s not crucial.

  2. You might want to expand the criteria for checking files. For instance, if the files do not end with .gz, you could remove the .gz extension in the code to accommodate this scenario. Some datasets may even have all files packed into an additional tar file, which is something to consider as well.

By the way, please create a pull request (PR) for your changes to the dev branch. 

gefujing commented 5 days ago

Hi, Anying, Kang, @Kang-chen @Mimianmy I checked this dataset and found the following problems with this dataset:

There are many samples that only have .mtx data but no feature.tsv or barcode.csv. After getting the prefix (according to the prefix of the .mtx data), it is not possible to create a folder and move feature.tsv or barcode.csv (because they do not exist).

Your code can provide a more detailed error message. However, in this task, this is not the most reasonable way to deal with it. Because there are complete data in this dataset, such as:

First:
GSM6805775_210407_HDg1_snRNA-A7_barcodes.tsv.gz
GSM6805775_210407_HDg1_snRNA-A7_features.tsv.gz
GSM6805775_210407_HDg1_snRNA-A7_matrix.mtx.gz

Second:
GSM6805776_210407_HDg1_snRNA-A8_barcodes.tsv.gz
GSM6805776_210407_HDg1_snRNA-A8_features.tsv.gz
GSM6805776_210407_HDg1_snRNA-A8_matrix.mtx.gz

There are two better ways to fix it:

  1. When extracting prefixes, consider not only the .mtx file or feature.tsv or barcode.csv, but the prefixes that exist in all three files. You only need to modify this line of code:
prefixes = set([f.rsplit('_', 1)[0] for f in matrix_files])
  1. Design a list that can read feature.tsv and barcode.csv files with more naming methods. You only need to modify this line of code:
features_files = [f for f in files if 'features.tsv.gz' in f or 'genes.tsv.gz' in f]
barcodes_files = [f for f in files if 'barcodes.tsv.gz' in f or 'identities.tsv.gz' in f]
matrix_files = [f for f in files if 'mtx.gz' in f]

You can refer to my code in other parts, for example:

keywords = ['count', 'raw', 'data', 'umi', 'smart', 'express']
exclude_keywords = ['meta', 'process', 'normalized', 'annotations', 'celltype', 'adt', 'cells', 'tissue']
txt_gz_files = [f for f in files if any(keyword in f.lower() for keyword in keywords) 
             and not any(exclude in f.lower() for exclude in exclude_keywords) 
             and (f.endswith('.txt.gz') or f.endswith('.csv.gz') or f.endswith('.tsv.gz')
                 or f.endswith('.txt') or f.endswith('.csv') or f.endswith('.tsv'))]
Mimianmy commented 3 days ago

@gefujing Actually based on the latest script, these files could be detected as:

image

However, it seems that the error still occurs, even though the relevant prefixes have been detected. Would it be useful if I added more keywords in detecting these files~?Thank you! image

gefujing commented 3 days ago

@Mimianmy You can develop and debug the following code:


# get prefixes
def extract_prefix(file_list):
    return set([f.rsplit('_', 1)[0] for f in file_list])

# read and merge 10x files
def read_and_merge_mtx_files(output_dir):

    files_and_dirs = os.listdir(output_dir)
    files = [f for f in files_and_dirs if os.path.isfile(os.path.join(output_dir, f))]
    dirs = [d for d in files_and_dirs if os.path.isdir(os.path.join(output_dir, d))]

    features_files = [f for f in files if 'features.tsv.gz' in f or 'genes.tsv.gz' in f]
    barcodes_files = [f for f in files if 'barcodes.tsv.gz' in f or 'identities.tsv.gz' in f]
    matrix_files = [f for f in files if 'mtx.gz' in f]

    if not (features_files and barcodes_files and matrix_files and not dirs):
        if dirs:
            print("Detected folders, processing them with read_10x_mtx...")
        else:
            print("Required feature, barcode, and matrix files not found.")
            return None

    # get prefixes
    features_prefixes = extract_prefix(features_files)
    barcodes_prefixes = extract_prefix(barcodes_files)
    matrix_prefixes = extract_prefix(matrix_files)

    prefixes = features_prefixes & barcodes_prefixes & matrix_prefixes

    adata_list = []

    for prefix in prefixes:
        feature_file = [f for f in features_files if f.startswith(prefix)][0]
        barcode_file = [f for f in barcodes_files if f.startswith(prefix)][0]
        matrix_file = [f for f in matrix_files if f.startswith(prefix)][0]

        # create folders
        folder_name = os.path.join(output_dir, prefix.strip('_'))
        if not os.path.exists(folder_name):
            os.makedirs(folder_name)

        # move flies and rename
        shutil.move(os.path.join(output_dir, feature_file), os.path.join(folder_name, 'features.tsv.gz'))
        shutil.move(os.path.join(output_dir, barcode_file), os.path.join(folder_name, 'barcodes.tsv.gz'))
        shutil.move(os.path.join(output_dir, matrix_file), os.path.join(folder_name, 'matrix.mtx.gz'))

        # check features
        check_and_remove_header(os.path.join(folder_name, 'features.tsv.gz'), file_type='features')
        check_and_remove_header(os.path.join(folder_name, 'barcodes.tsv.gz'), file_type='barcodes')
        check_and_update_features(os.path.join(folder_name, 'features.tsv.gz'))

        # read 10x files
        print(f"Reading {folder_name}...")
        adata = sc.read_10x_mtx(folder_name, var_names='gene_symbols')
        adata = filter_single_adata_by_gene_count(adata)
        if adata is None:
            continue
        adata = convert_symbol_to_ensembl(adata)
        adata_list.append(adata)

    # directly read folders
    for folder in dirs:
        folder_path = os.path.join(output_dir, folder)
        print(f"Directly reading from folder {folder_path}...")
        adata = sc.read_10x_mtx(folder_path, var_names='gene_symbols')
        adata = filter_single_adata_by_gene_count(adata)
        if adata is None:
            continue
        adata = convert_symbol_to_ensembl(adata)
        adata_list.append(adata)

    # merge adata
    adata_list = filter_adata_by_species(adata_list)
    adata_list = filter_adata_by_gene_count(adata_list)

    if len(adata_list) > 1:
        print("Merging AnnData objects...")
        adata_merged = ad.concat(adata_list, label="batch")
    else:
        adata_merged = adata_list[0]

    return adata_merged

These codes add the function to extract the prefix and find the common prefix by taking the intersection, hoping to solve the problem.