Closed marcmaxson closed 3 years ago
Still and issue with some GEO download datasets:
ValueError: IDATs with varying number of probes: {1052641, 1051815}
As of methylprep v1.5.2, I thought should was resolved, but it is not:
WARNING:methylprep.models.sigset:These IDATs have varying numbers of probes: [(622399, 3), (1051815, 1)] for these array types: [(<ArrayType.ILLUMINA_450K: '450k'>, 3), (<ArrayType.ILLUMINA_EPIC: 'epic'>, 1)]
WARNING:methylprep.models.sigset:(Processing will drop any probes that are not found across all samples for a given array type.)
...
File "/Users/mmaxmeister/methylprep/methylprep/models/sigset.py", line 27, in get_array_type
raise ValueError('IDATs with varying array types')
ValueError: IDATs with varying array types
and this doesn't work either yet:
python -m methylprep -v process -d . --all --no_sample_sheet --batch_size 1
Latest methylprep has a workaround and gives user instructions on how to run each batch of samples separately.
Some GEO data sets contain both kinds of sample array data. The current fix requires additional user steps, which could be dealt with programmatically:
How to process a batch of GEO idats using methylprep and pipeline (fastest way)
python -m methylprep download -d GSE142512 -i GSE142512
where -d is the data folder to place files, and -i is the GSE ID.python -m methylprep sample_sheet -d GSE142512 -t Blood --create
note the '--create' part is vital, otherwise it just reads an existing samplesheet. In this example, I am adding in the sample_type for all of these samples ('Blood') as it was not in the samplesheet meta data already. This command uses sample idat filenames, and doesn't necessarily parse all of the _family_meta_data stuff. The basic stuff isSample_Name (added) | GSM_ID | Sentrix_ID | Sentrix_Position (from filename)
For the full meta data, usepython -m methylprep meta_data -d GSE142512 -i GSE142512
instead.methylprep process