add check to ignore genome(s) that cannot be up downloaded

bcpd commented 1 year ago

Re-opening the issue below as a new issue. I am having the same issue. Help would be greatly appreciated.

         I am still having issues with this error. I think that there are still some rules that need to incorporate a check to ignore genomes that cannot be downloaded.

These rules correctly ignore the missing genome specified in the yaml:

download_matching_genome
make_genbank_info_csv
bam_to_depth_wc
minimap_wc
samtools_mpileup_wc
samtools_count_wc
bam_to_fastq_wc

The first rule that is creating an error is extract_leftover_reads_wc. I checked its code and it seems that it uses as input the gather_csv file but it does not check for the flagged genomes in the python script substract_gather.py

   input:
        csv = f'{outdir}/gather/{{sample}}.gather.csv.gz',
        mapped = Checkpoint_GatherResults(f"{outdir}/mapping/{{sample}}.x.{{ident}}.mapped.fq.gz"),

These other rules also used that csv as input make_gather_notebook_wc - > Uses papermill and report-gather.ipynb make_mapping_notebook_wc -> Uses papermill and report-mapping.ipynb .

A possible solution would be to pass as an argument the list of flagged genomes (IGNORE_IDENTS) to the python script when it is loading the list of genomes from the csv

Line 29:

    with gzip.open(args.gather_csv, "rt") as fp:
        r = csv.DictReader(fp)
        for row in r:
            rows.append(row)
    print(f"...loaded {len(rows)} results total.")

    print("checking input/output pairs:")
    pairs = []
    fail = False
    for row in rows:
        acc = row["name"].split()[0]
>>>if acc in IGNORE_IDENTS:
>>>   continue
>>>   print("Ignoring {acc} ")
>>>else:
            filename = f"{outdir}/mapping/{sample_id}.x.{acc}.mapped.fq.gz"
            overlapping = f"{outdir}/mapping/{sample_id}.x.{acc}.overlap.fq.gz"
            leftover = f"{outdir}/mapping/{sample_id}.x.{acc}.leftover.fq.gz"
            if not os.path.exists(filename):
                print(f"ERROR: input filename {filename} does not exist. Will exit.")
                 fail = True
            pairs.append((acc, filename, overlapping, leftover))

I don't know enough about python notebooks to suggest a solution there.

Originally posted by @carden24 in https://github.com/dib-lab/genome-grist/issues/241#issuecomment-1496898984

ctb commented 1 year ago

see also same problem in two other pipelines 😭 -

https://github.com/dib-lab/charcoal/issues/235 https://github.com/dib-lab/2022-dominating-set-differential-abundance-example/issues/8#issue-1658017422

carden24 commented 1 year ago

Apparently the error is solved if you delete the row with the problematic genome from the {output_folder}/gather/{sample}.gather.csv.gz. Just need to gunzip it, remove the row, and gzip it again.

ctb commented 1 year ago

yep. however, that then has the effect of ignoring any other genome (or genomes) that would have been chosen in the absence of the problematic genome. e.g. if there's a specific E. coli genome that is no longer available, by removing it from the gather output, you are probably eliminating all the E. coli genomes.

the fix I have in mind (but need to find time to implement robustly, and test) would exclude the specific problematic genome from the search, while allowing other related genomes that are NOT problematic to be included.

I believe you could mimic that here by removing the problematic genome from the prefetch file, rather than the gather file.

carden24 commented 1 year ago

I have come with a solution to ignore the problematic genomes.

I run grist asking for an intermediate target, instead of the full pipeline. This creates the signatures and a list of prefetched genomes.

genome-grist run {grist_config.yaml} gather_reads

I run a script that parses through the prefetch files and looks at the accession ids shown there and check if each genomes is available at the ftp. If the genome is not available, it remove its corresponding row from the prefetch files. repair_grist_gather_files.zip

Usage: python repair_grist_gather_files --grist_output_folder <grist_folder>

I run grist again with the full target

genome-grist run grist_config.yaml summarize_gather summarize_mapping

An advantage is that we do not need to specify the ignore genomes parameter in the configuration, if will never run into this problem when downloading them. The process take a few minutes as it need to check each prefetch genome separately.

My assumption is that if a genome is present in the prefetch list, there is likely another closely related genome also in the list. so even if we don't get the best match, we will have a decent match. This assumption probably holds better when using the full database and not a dereplicated one.

I have used the same python libraries that grist scripts use so there should not be a major compatibility issue.

dib-lab / genome-grist

add check to ignore genome(s) that cannot be up downloaded #277