FoxoTech / methylprep

Python-based preprocessing software for Illumina methylation arrays
MIT License
34 stars 14 forks source link

Methylprep doesn't process mixed data (EPIC + 450k) sets yet #61

Closed marcmaxson closed 3 years ago

marcmaxson commented 4 years ago

Some GEO data sets contain both kinds of sample array data. The current fix requires additional user steps, which could be dealt with programmatically:

How to process a batch of GEO idats using methylprep and pipeline (fastest way)

  1. Use web browser to find your data set. Note the GSExxxxx ID, such as GSE142512.
  2. python -m methylprep download -d GSE142512 -i GSE142512 where -d is the data folder to place files, and -i is the GSE ID.
  3. This example data set contains both 450k and EPIC samples. So I used MacOS finder (Windows File Explorer) to move the 450k idats into another folder first, because they need to be processed separately. (methylprep and pipeline don't process mixed data sets yet.)
  4. Create a sample sheet for EACH folder of idats: python -m methylprep sample_sheet -d GSE142512 -t Blood --create note the '--create' part is vital, otherwise it just reads an existing samplesheet. In this example, I am adding in the sample_type for all of these samples ('Blood') as it was not in the samplesheet meta data already. This command uses sample idat filenames, and doesn't necessarily parse all of the _family_meta_data stuff. The basic stuff is Sample_Name (added) | GSM_ID | Sentrix_ID | Sentrix_Position (from filename) For the full meta data, use python -m methylprep meta_data -d GSE142512 -i GSE142512 instead.
  5. From here, you can process each folder separately with methylprep process
marcmaxson commented 3 years ago

Still and issue with some GEO download datasets: ValueError: IDATs with varying number of probes: {1052641, 1051815}

marcmaxson commented 3 years ago

As of methylprep v1.5.2, I thought should was resolved, but it is not:

WARNING:methylprep.models.sigset:These IDATs have varying numbers of probes: [(622399, 3), (1051815, 1)] for these array types: [(<ArrayType.ILLUMINA_450K: '450k'>, 3), (<ArrayType.ILLUMINA_EPIC: 'epic'>, 1)]

WARNING:methylprep.models.sigset:(Processing will drop any probes that are not found across all samples for a given array type.)
...
  File "/Users/mmaxmeister/methylprep/methylprep/models/sigset.py", line 27, in get_array_type
    raise ValueError('IDATs with varying array types')
ValueError: IDATs with varying array types

and this doesn't work either yet:

python -m methylprep -v process -d . --all --no_sample_sheet --batch_size 1

marcmaxson commented 3 years ago

Latest methylprep has a workaround and gives user instructions on how to run each batch of samples separately.