Completeness and contamination missing from fetch_ncbi.py

EBI-Metagenomics / genomes-catalogue-pipeline

MGnify genome analysis pipeline

Other

97 stars 21 forks source link

Completeness and contamination missing from fetch_ncbi.py #80

Open amardeepranu opened 8 months ago

amardeepranu commented 8 months ago

Are we suppose to compile this data ourselves? fetch_ena.py seems to generate this data, but fetch_ncbi.py does not.

amardeepranu commented 8 months ago

Ah looks like its done here: https://github.com/EBI-Metagenomics/genomes-pipeline/blob/853487f6dda1420fd8b6b41dd4aff5c8540c7e37/subworkflows/prepare_data.nf#L27-L29

is there a reason the CHECKM step isn't done for the ENA data as well? Specifically for data that isn't pulled using the fetch_ena.py script?

amardeepranu commented 8 months ago

https://github.com/EBI-Metagenomics/genomes-pipeline/blob/853487f6dda1420fd8b6b41dd4aff5c8540c7e37/workflows/genomes_annotation.nf#L90-L99

I also see here that all ncbi data is ignored, can I edit this and include ncbi data without issue? Thanks

tgurbich commented 7 months ago

Hi @amardeepranu,

All genomes, regardless of where they were downloaded from, should be passed to the pipeline using the --ena_genomes flag. If any of your genomes were fetched using the fetch_ncbi.py script, you need to run CheckM on them and combine all completeness and contamination results into one file for all ENA and NCBI genomes. Pass the path to this combined file to the pipeline using the --ena_genomes_checkm flag.

We will adjust this in the future releases to avoid confusion.