Open bcpd opened 1 year ago
see also same problem in two other pipelines 😭 -
https://github.com/dib-lab/charcoal/issues/235 https://github.com/dib-lab/2022-dominating-set-differential-abundance-example/issues/8#issue-1658017422
Apparently the error is solved if you delete the row with the problematic genome from the {output_folder}/gather/{sample}.gather.csv.gz. Just need to gunzip it, remove the row, and gzip it again.
yep. however, that then has the effect of ignoring any other genome (or genomes) that would have been chosen in the absence of the problematic genome. e.g. if there's a specific E. coli genome that is no longer available, by removing it from the gather output, you are probably eliminating all the E. coli genomes.
the fix I have in mind (but need to find time to implement robustly, and test) would exclude the specific problematic genome from the search, while allowing other related genomes that are NOT problematic to be included.
I believe you could mimic that here by removing the problematic genome from the prefetch file, rather than the gather file.
I have come with a solution to ignore the problematic genomes.
genome-grist run {grist_config.yaml} gather_reads
Usage:
python repair_grist_gather_files --grist_output_folder <grist_folder>
genome-grist run grist_config.yaml summarize_gather summarize_mapping
An advantage is that we do not need to specify the ignore genomes parameter in the configuration, if will never run into this problem when downloading them. The process take a few minutes as it need to check each prefetch genome separately.
My assumption is that if a genome is present in the prefetch list, there is likely another closely related genome also in the list. so even if we don't get the best match, we will have a decent match. This assumption probably holds better when using the full database and not a dereplicated one.
I have used the same python libraries that grist scripts use so there should not be a major compatibility issue.
Re-opening the issue below as a new issue. I am having the same issue. Help would be greatly appreciated.
These rules correctly ignore the missing genome specified in the yaml:
The first rule that is creating an error is extract_leftover_reads_wc. I checked its code and it seems that it uses as input the gather_csv file but it does not check for the flagged genomes in the python script substract_gather.py
These other rules also used that csv as input make_gather_notebook_wc - > Uses papermill and report-gather.ipynb make_mapping_notebook_wc -> Uses papermill and report-mapping.ipynb .
A possible solution would be to pass as an argument the list of flagged genomes (IGNORE_IDENTS) to the python script when it is loading the list of genomes from the csv
Line 29:
I don't know enough about python notebooks to suggest a solution there.
Originally posted by @carden24 in https://github.com/dib-lab/genome-grist/issues/241#issuecomment-1496898984