ebi-gene-expression-group / scxa-bundle-workflow

Workflow for making bundles for Single-cell Expression Atlas
0 stars 0 forks source link

cell_to_library lines don't seem to be written into the MANIFEST #13

Open pcm32 opened 2 years ago

pcm32 commented 2 years ago

It seems that somehow MANIFESTS are not getting the cell_to_library lines written even though studies do have cell_to_library.txt files within the bundle:

(miniconda3)[host results]$ ls -l */*/bundle/filtered_normalised/cell_to_library.txt
-rw-r--r-- 1 fg_atlas_sc microarray   295035 Dec  3  2021 E-CURD-9/mus_musculus/bundle/filtered_normalised/cell_to_library.txt
-rw-r--r-- 1 fg_atlas_sc microarray  2020908 Dec  3  2021 E-ENAD-49/arabidopsis_thaliana/bundle/filtered_normalised/cell_to_library.txt
-rw-r--r-- 1 fg_atlas_sc microarray   803332 Dec  3  2021 E-ENAD-51/zea_mays/bundle/filtered_normalised/cell_to_library.txt
-rw-r--r-- 1 fg_atlas_sc microarray   225959 Dec  3  2021 E-ENAD-53/solanum_lycopersicum/bundle/filtered_normalised/cell_to_library.txt
-rw-r--r-- 1 fg_atlas_sc microarray   124829 Dec  3  2021 E-GEOD-130148/homo_sapiens/bundle/filtered_normalised/cell_to_library.txt
-rw-r--r-- 1 fg_atlas_sc microarray   135143 May 30 11:27 E-GEOD-137537/homo_sapiens/bundle/filtered_normalised/cell_to_library.txt
-rw-r--r-- 1 fg_atlas_sc microarray  1964387 Aug 19  2021 E-GEOD-141273/drosophila_melanogaster/bundle/filtered_normalised/cell_to_library.txt
-rw-r--r-- 1 fg_atlas_sc microarray   424576 Oct 14  2021 E-GEOD-141730/arabidopsis_thaliana/bundle/filtered_normalised/cell_to_library.txt
-rw-r--r-- 1 fg_atlas_sc microarray  2112572 Dec  3  2021 E-GEOD-150728/homo_sapiens/bundle/filtered_normalised/cell_to_library.txt
-rw-r--r-- 1 fg_atlas_sc microarray  2088049 Aug 18  2021 E-HCAD-10/homo_sapiens/bundle/filtered_normalised/cell_to_library.txt
-rw-r--r-- 1 fg_atlas_sc microarray 22494236 Aug 12  2021 E-HCAD-1/homo_sapiens/bundle/filtered_normalised/cell_to_library.txt
-rw-r--r-- 1 fg_atlas_sc microarray   766738 Dec  3  2021 E-HCAD-30/homo_sapiens/bundle/filtered_normalised/cell_to_library.txt
-rw-r--r-- 1 fg_atlas_sc microarray  2247484 Aug  4  2021 E-HCAD-32/homo_sapiens/bundle/filtered_normalised/cell_to_library.txt
-rw-r--r-- 1 fg_atlas_sc microarray   327955 Dec  6  2021 E-HCAD-9/homo_sapiens/bundle/filtered_normalised/cell_to_library.txt
-rw-r--r-- 1 fg_atlas_sc microarray   182649 Dec  3  2021 E-MTAB-6945/mus_musculus/bundle/filtered_normalised/cell_to_library.txt
-rw-r--r-- 1 fg_atlas_sc microarray   392474 Dec  3  2021 E-MTAB-7142/mus_musculus/bundle/filtered_normalised/cell_to_library.txt
-rw-r--r-- 1 fg_atlas_sc microarray   606703 Aug 19  2021 E-MTAB-8698/drosophila_melanogaster/bundle/filtered_normalised/cell_to_library.txt
-rw-r--r-- 1 fg_atlas_sc microarray   224737 Jul 22 15:35 E-MTAB-8848/mus_musculus/bundle/filtered_normalised/cell_to_library.txt
(miniconda3)[host results]$ grep cell_to */*/bundle/MANIFEST
(miniconda3)[host results]$ grep cell_to E-MTAB-7142/mus_musculus/bundle/MANIFEST
(miniconda3)[host results]$ grep cell_to E-MTAB-8848/mus_musculus/bundle/MANIFEST
(miniconda3)[host results]$

this is breaking the loading as we get errors of the type:

Cell types file is present at */atlas-prod/sc_experiments_test/E-GEOD-141273/E-GEOD-141273.cells.txt, but no cell/ library maping is available at */atlas-prod/sc_experiments_test/E-GEOD-141273/cell_to_library.txt - this file is required to map cell metadata to libraries

I don't see any lines in this workflow that implies that that is being written. Note that the E-MTAB-8848 has been quite recently generated and it doesn't include either the line in the manifest.

pcm32 commented 2 years ago

There is more bundle interacting code at https://github.com/ebi-gene-expression-group/scxa-control-workflow/blob/develop/main.nf#L1237 . And while there is also a lot of mention of CELL_TO_LIBRARY there , I don't see there either where the lines for cell_to_library in the manifest would go.

pcm32 commented 2 years ago

I think that the issue is that in atlas-prod develop it simply interacts directly with the file if it exists,

https://github.com/ebi-gene-expression-group/atlas-prod/blob/develop/exec/import_scxa_experiment.sh#L200

but in the anndata-tweak branch, it is asking for the cell_to_library entry in the manifests (and failing):

https://github.com/ebi-gene-expression-group/atlas-prod/blob/feature/anndata_import_tweaks/exec/import_scxa_experiment.sh#L96

as that has never lived in the MANIFESTs. So it doesn't get copied and then condensed SDRF for single cell doesn't find the file and fails at https://github.com/ebi-gene-expression-group/experiment_metadata/blob/1afc9fd63cb224f09ce74d48423b8fcdc0f1cb06/single_cell_condensed_sdrf.sh#L100 .

pcm32 commented 2 years ago

I have partly alleviated this through https://github.com/ebi-gene-expression-group/atlas-prod/pull/246/commits/7edc3126c8b60a0c0bb0ef3cc0388e1497969a31 but we should add a PR to this repo that makes sure that the MANIFEST file gets as well the cell_to_library line. Could you please take care of that @irisdianauy ? Thanks!