Open eboileau opened 1 year ago
The lack of clear GEO guideline (at least for scRNA-seq) leads to a plethora of different submission formats, many without metadata, etc. This is a nightmare... I largely underestimated how messy that would be! I'm just adding a few notes:
[ ] Loupe Cell Browser (.cloupe) export appears to be limited to projections, I need to check; otherwise binaries (crconverter) to generate .cloupe files do not seem to be open source. Skip for now, e.g. GSE162326.
[ ] Missing filelist.txt and/or mixed formats. There is no way of dealing with these, they are just too messy e.g. all samples mixed in a large TXT file w/o an easy way to match them with GEO SAMPLE entries, or MEX-like format combining multiple samples, with non-standard names, etc. Some examples include GSE175634, GSE168742, GSE133996, GSE165838,GSE121893, etc. We need to wrangle them manually.
[ ] Tar twice and uncompressed MEX format, all with same name. This is annoying, but we can deal with that manually: untar twice, rename files, change filelist.txt, and run wrangling.
[ ] H5 files (or mix of different formats incl. H5). We could possibly do something, reading them using Scanpy read_10x_h5
.
[ ] Robj.gz, e.g. GSE183852. Nope.
[ ] Some, like GSE129987, provide matrices in transposed format! It would be easy to just add an option to transpose it.
[ ] We already deal with non-standard gene ids/names e.g. with underscores, but only in case we need to use a lookup with MyGene. We should probably clean the var_names in any case for the portal, and also remove dots in gene ids, if present.
[ ] Build recipes via commit, manual, or GitHub deployment, and add Singularity images to some registry
[ ] Test package on Debian11 - add CI/tests (integration/regression)!
[ ] Currently, matrices are written as sparse, and we should stick to this; however this is problematic for some of the portal's functionalities ( e.g. Multigene Display Viewer - volcano and quadrant plots). We should actually fix that on the portal side.
[ ]
wrangling
entirely relies on existing GEO series (and output from accession), we could extend the program to deal with non GEO input e.g. any bulk data matrix to be processed, etc.[ ] Some checks are missing for single-cell cluster or cell_type column, as per formatting guidelines.
[ ] Currently, if GEO ID does not exist, or if the program fails to download the files, calls to
geo_utils.get_GEO
orget_supp_files
, etc. fail andaccession
crashes without logging the error. We need to handle these cases by reporting the error, skipping, and continuing to the next GEO ID, in case we have multiple IDs.