EichlerLab / pav

Phased assembly variant caller
92 stars 7 forks source link

Error in rule call_cigar #63

Open GuillaumeHolley opened 4 days ago

GuillaumeHolley commented 4 days ago

Hi,

So I recently pulled the Singularity image of PAV 2.4.2.1 which I have used in parallel on a "large" number of assemblies. All the jobs eventually failed after some time so I restarted it on a single assembly which eventually failed too. For some reasons, the errors only got printed in the terminal but are not included in the Snakemake log file so I am copying them here

RuleException:
KeyError in file /opt/pav/rules/call.snakefile, line 810:
"sequence 'H2#contig_6639' not present"
  File "/opt/pav/rules/call.snakefile", line 810, in __rule_call_cigar
  File "/opt/pav/pavlib/cigarcall.py", line 66, in make_insdel_snv_calls
  File "pysam/libcfaidx.pyx", line 301, in pysam.libcfaidx.FastaFile.fetch
[Mon Oct  7 10:17:49 2024]
Error in rule call_cigar:
    jobid: 0
    input: results/XXXXXXX/align/trim-none/aligned_tig_h2.bed.gz, results/XXXXXXX/align/trim-tigref/aligned_tig_h2.bed.gz, temp/XXXXXXX/align/contigs_h2.fa.gz
    output: temp/XXXXXXX/cigar/batched/insdel_h2_6.bed.gz, temp/XXXXXXX/cigar/batched/snv.bed_h2_6.gz

Exiting because a job execution failed. Look above for error message
WorkflowError:
At least one job did not complete successfully.
[E::bgzf_index_load_hfile] Error reading temp/XXXXXXX/align/contigs_h2.fa.gz.gzi : Success
[E::fai_load3_core] Failed to load .gzi index: temp/XXXXXXX/align/contigs_h2.fa.gz.gzi
RuleException:
OSError in file /opt/pav/rules/call.snakefile, line 810:
error when opening file `temp/XXXXXXX/align/contigs_h2.fa.gz`
  File "/opt/pav/rules/call.snakefile", line 810, in __rule_call_cigar
  File "/opt/pav/pavlib/cigarcall.py", line 64, in make_insdel_snv_calls
  File "pysam/libcfaidx.pyx", line 121, in pysam.libcfaidx.FastaFile.__cinit__
  File "pysam/libcfaidx.pyx", line 181, in pysam.libcfaidx.FastaFile._open
[Mon Oct  7 10:17:49 2024]
Error in rule call_cigar:
    jobid: 0
    input: results/XXXXXXX/align/trim-none/aligned_tig_h2.bed.gz, results/XXXXXXX/align/trim-tigref/aligned_tig_h2.bed.gz, temp/XXXXXXX/align/contigs_h2.fa.gz
    output: temp/XXXXXXX/cigar/batched/insdel_h2_2.bed.gz, temp/XXXXXXX/cigar/batched/snv.bed_h2_2.gz

Exiting because a job execution failed. Look above for error message
WorkflowError:
At least one job did not complete successfully.
RuleException:
KeyError in file /opt/pav/rules/call.snakefile, line 810:
"sequence 'H2#contig_3965' not present"
  File "/opt/pav/rules/call.snakefile", line 810, in __rule_call_cigar
  File "/opt/pav/pavlib/cigarcall.py", line 66, in make_insdel_snv_calls
  File "pysam/libcfaidx.pyx", line 301, in pysam.libcfaidx.FastaFile.fetch
[Mon Oct  7 10:17:49 2024]
Error in rule call_cigar:
    jobid: 0
    input: results/XXXXXXX/align/trim-none/aligned_tig_h2.bed.gz, results/XXXXXXX/align/trim-tigref/aligned_tig_h2.bed.gz, temp/XXXXXXX/align/contigs_h2.fa.gz
    output: temp/XXXXXXX/cigar/batched/insdel_h2_1.bed.gz, temp/XXXXXXX/cigar/batched/snv.bed_h2_1.gz

Exiting because a job execution failed. Look above for error message
WorkflowError:
At least one job did not complete successfully.
[E::bgzf_index_load_hfile] Error reading temp/XXXXXXX/align/contigs_h2.fa.gz.gzi : No such file or directory
[E::fai_load3_core] Failed to load .gzi index: temp/XXXXXXX/align/contigs_h2.fa.gz.gzi
RuleException:
OSError in file /opt/pav/rules/call.snakefile, line 810:
error when opening file `temp/XXXXXXX/align/contigs_h2.fa.gz`
  File "/opt/pav/rules/call.snakefile", line 810, in __rule_call_cigar
  File "/opt/pav/pavlib/cigarcall.py", line 64, in make_insdel_snv_calls
  File "pysam/libcfaidx.pyx", line 121, in pysam.libcfaidx.FastaFile.__cinit__
  File "pysam/libcfaidx.pyx", line 181, in pysam.libcfaidx.FastaFile._open
[Mon Oct  7 10:17:49 2024]
Error in rule call_cigar:
    jobid: 0
    input: results/XXXXXXX/align/trim-none/aligned_tig_h2.bed.gz, results/XXXXXXX/align/trim-tigref/aligned_tig_h2.bed.gz, temp/XXXXXXX/align/contigs_h2.fa.gz
    output: temp/XXXXXXX/cigar/batched/insdel_h2_4.bed.gz, temp/XXXXXXX/cigar/batched/snv.bed_h2_4.gz

Exiting because a job execution failed. Look above for error message
WorkflowError:
At least one job did not complete successfully.
[Mon Oct  7 10:17:49 2024]
Error in rule call_cigar:
    jobid: 119
    input: results/XXXXXXX/align/trim-none/aligned_tig_h2.bed.gz, results/XXXXXXX/align/trim-tigref/aligned_tig_h2.bed.gz, temp/XXXXXXX/align/contigs_h2.fa.gz
    output: temp/XXXXXXX/cigar/batched/insdel_h2_6.bed.gz, temp/XXXXXXX/cigar/batched/snv.bed_h2_6.gz

[Mon Oct  7 10:17:49 2024]
Error in rule call_cigar:
    jobid: 115
    input: results/XXXXXXX/align/trim-none/aligned_tig_h2.bed.gz, results/XXXXXXX/align/trim-tigref/aligned_tig_h2.bed.gz, temp/XXXXXXX/align/contigs_h2.fa.gz
    output: temp/XXXXXXX/cigar/batched/insdel_h2_2.bed.gz, temp/XXXXXXX/cigar/batched/snv.bed_h2_2.gz

[Mon Oct  7 10:17:49 2024]
Error in rule call_cigar:
    jobid: 114
    input: results/XXXXXXX/align/trim-none/aligned_tig_h2.bed.gz, results/XXXXXXX/align/trim-tigref/aligned_tig_h2.bed.gz, temp/XXXXXXX/align/contigs_h2.fa.gz
    output: temp/XXXXXXX/cigar/batched/insdel_h2_1.bed.gz, temp/XXXXXXX/cigar/batched/snv.bed_h2_1.gz

[Mon Oct  7 10:17:49 2024]
Error in rule call_cigar:
    jobid: 117
    input: results/XXXXXXX/align/trim-none/aligned_tig_h2.bed.gz, results/XXXXXXX/align/trim-tigref/aligned_tig_h2.bed.gz, temp/XXXXXXX/align/contigs_h2.fa.gz
    output: temp/XXXXXXX/cigar/batched/insdel_h2_4.bed.gz, temp/XXXXXXX/cigar/batched/snv.bed_h2_4.gz

[E::fai_read] Could not understand FASTA index temp/XXXXXXX/align/contigs_h1.fa.gz.fai line 870
[E::fai_load3_core] Failed to read FASTA index temp/XXXXXXX/align/contigs_h1.fa.gz.fai
RuleException:
OSError in file /opt/pav/rules/call.snakefile, line 810:
error when opening file `temp/XXXXXXX/align/contigs_h1.fa.gz`
  File "/opt/pav/rules/call.snakefile", line 810, in __rule_call_cigar
  File "/opt/pav/pavlib/cigarcall.py", line 64, in make_insdel_snv_calls
  File "pysam/libcfaidx.pyx", line 121, in pysam.libcfaidx.FastaFile.__cinit__
  File "pysam/libcfaidx.pyx", line 181, in pysam.libcfaidx.FastaFile._open
[Mon Oct  7 10:17:51 2024]
Error in rule call_cigar:
    jobid: 0
    input: results/XXXXXXX/align/trim-none/aligned_tig_h1.bed.gz, results/XXXXXXX/align/trim-tigref/aligned_tig_h1.bed.gz, temp/XXXXXXX/align/contigs_h1.fa.gz
    output: temp/XXXXXXX/cigar/batched/insdel_h1_3.bed.gz, temp/XXXXXXX/cigar/batched/snv.bed_h1_3.gz

Exiting because a job execution failed. Look above for error message
WorkflowError:
At least one job did not complete successfully.
RuleException:
KeyError in file /opt/pav/rules/call.snakefile, line 810:
"sequence 'H1#contig_6041' not present"
  File "/opt/pav/rules/call.snakefile", line 810, in __rule_call_cigar
  File "/opt/pav/pavlib/cigarcall.py", line 66, in make_insdel_snv_calls
  File "pysam/libcfaidx.pyx", line 301, in pysam.libcfaidx.FastaFile.fetch
[Mon Oct  7 10:17:51 2024]
Error in rule call_cigar:
    jobid: 0
    input: results/XXXXXXX/align/trim-none/aligned_tig_h1.bed.gz, results/XXXXXXX/align/trim-tigref/aligned_tig_h1.bed.gz, temp/XXXXXXX/align/contigs_h1.fa.gz
    output: temp/XXXXXXX/cigar/batched/insdel_h1_0.bed.gz, temp/XXXXXXX/cigar/batched/snv.bed_h1_0.gz

RuleException:
KeyError in file /opt/pav/rules/call.snakefile, line 810:
"sequence 'H1#contig_91' not present"
  File "/opt/pav/rules/call.snakefile", line 810, in __rule_call_cigar
  File "/opt/pav/pavlib/cigarcall.py", line 66, in make_insdel_snv_calls
  File "pysam/libcfaidx.pyx", line 301, in pysam.libcfaidx.FastaFile.fetch
[Mon Oct  7 10:17:51 2024]
Error in rule call_cigar:
    jobid: 0
    input: results/XXXXXXX/align/trim-none/aligned_tig_h1.bed.gz, results/XXXXXXX/align/trim-tigref/aligned_tig_h1.bed.gz, temp/XXXXXXX/align/contigs_h1.fa.gz
    output: temp/XXXXXXX/cigar/batched/insdel_h1_6.bed.gz, temp/XXXXXXX/cigar/batched/snv.bed_h1_6.gz

Exiting because a job execution failed. Look above for error message
Exiting because a job execution failed. Look above for error message
WorkflowError:
At least one job did not complete successfully.
WorkflowError:
At least one job did not complete successfully.
[Mon Oct  7 10:17:52 2024]
[Mon Oct  7 10:17:52 2024]
Error in rule call_cigar:
    jobid: 17
    input: results/XXXXXXX/align/trim-none/aligned_tig_h1.bed.gz, results/XXXXXXX/align/trim-tigref/aligned_tig_h1.bed.gz, temp/XXXXXXX/align/contigs_h1.fa.gz
    output: temp/XXXXXXX/cigar/batched/insdel_h1_3.bed.gz, temp/XXXXXXX/cigar/batched/snv.bed_h1_3.gz

Error in rule call_cigar:
    jobid: 9
    input: results/XXXXXXX/align/trim-none/aligned_tig_h1.bed.gz, results/XXXXXXX/align/trim-tigref/aligned_tig_h1.bed.gz, temp/XXXXXXX/align/contigs_h1.fa.gz
    output: temp/XXXXXXX/cigar/batched/insdel_h1_0.bed.gz, temp/XXXXXXX/cigar/batched/snv.bed_h1_0.gz

[Mon Oct  7 10:17:52 2024]
Error in rule call_cigar:
    jobid: 20
    input: results/XXXXXXX/align/trim-none/aligned_tig_h1.bed.gz, results/XXXXXXX/align/trim-tigref/aligned_tig_h1.bed.gz, temp/XXXXXXX/align/contigs_h1.fa.gz
    output: temp/XXXXXXX/cigar/batched/insdel_h1_6.bed.gz, temp/XXXXXXX/cigar/batched/snv.bed_h1_6.gz

I am also enclosing the Snakemake log file if that is any useful. I would appreciate any feedback on this issue.

Thanks, Guillaume snakemake.log

paudano commented 4 days ago

It looks like the assembly FASTA records were altered, was "H2" added to "H2#contig_6639" after assembly? If so, can you re-index the FASTA files (samtools faidx)? It looks like the record names in the '.fai' file don't match the FASTA file, and I think this is why it's failing to read the FASTA file.

PAV only requires contig names to be unique per assembled haplotype (i.e. "contig_0001" may appear in both haplotypes 1 and 2). It's not necessary to add the haplotype to contigs for PAV, but perfectly fine to do that if you need it, just re-index FASTA files before calling PAV. From PAV calls, it's possible to trace each variant back to the correct assembly haplotype and contig coordinates without ambiguity.

GuillaumeHolley commented 4 days ago

Hi @paudano,

Thanks for getting back to me so quickly!

So the input files I have provided to PAV are not indexed (there are no .fai files with the input FASTA). I also double-checked and H2#contig_6639 is a unique record in the second haplotype assembly FASTA, no other record exist with that name (the record also does not occur in the first haplotype assembly). The assembly pipeline we are using automatically prefixes each haplotig with the haplotype identifier it is from. It was actually made this way for downstream tools/pipelines processing the 2 haplotype assemblies at once and which, unlike PAV, would not be able to distinguish 2 haplotigs with the same name but from different FASTA files.

paudano commented 4 days ago

I need to see where the .fai and .gzi files are coming from. Can you share the output from running this?

file temp/XXXXXXX/align/contigs_h2.fa.gz*
GuillaumeHolley commented 4 days ago

I mistakenly erased the directory so I'm gonna restart the computation and report on the file you need info on.

In the meantime, I have been rerunning the exact same command I used but with an older version of PAV (2.2.4.1) that I had downloaded some time ago and it has finished without errors.

GuillaumeHolley commented 3 days ago

I restarted the job and it failed at the same step. Here is the output of file temp/XXXXXXX/align/contigs_h2.fa.gz*:

temp/XXXXXXX/align/contigs_h2.fa.gz:     Blocked GNU Zip Format (BGZF; gzip compatible), block length 19953
temp/XXXXXXX/align/contigs_h2.fa.gz.fai: ASCII text
temp/XXXXXXX/align/contigs_h2.fa.gz.gzi: data

Similar for contigs_h1. Both index files .fai seem normal.

paudano commented 3 days ago

If the indexes are present, PAV links to the existing indexes, otherwise, they are generated with samtools faidx (samtools 1.19 is in the container). For yours, PAV has generated the indexes. It's gotten through alignments, so minimap2 had no problems reading the FASTA, but it doesn't need the index since it reads all sequences. It's not until PAV tries to open the file for random access with pysam that it f ails (pysam 0.22.1 is in the container).

I have tried a bunch of ways to replicate the issue, but so far, I cannot. I created a small FASTA file with the exact contig, but so far, opening the FASTA and retrieving the sequence never fails even when using the container directly to do it. I've tried adding spaces before and after the name. I thought the "#" in the sequence name might be a problem, but I can't find anything that's not handling it correctly.

My best guess so far is when the contigs were renamed, spaces, special characters, or a mix of Windows (CR-LF) and Linux (LF) line breaks could have been introduced, but I can't be sure this is it. I'd zcat a few records to a plain-text file and look at it with hexedit to see if there's anything odd around the record IDs. If this is the case, it's possible one tool could read through these (minimap2) while another fails (pysam).

Are you able to share an assembly so that I can run and troubleshoot it? I can setup a secure Globus transfer. If not, I can try renaming contigs on one of my own samples to match yours and see if the error turns up in a whole PAV run, but I doubt this would reproduce the error based on my tests so far.

GuillaumeHolley commented 2 days ago

Hi @paudano,

I appreciate the work you've put in trying to reproduce the issue. I also did try a few things. At first, I thought maybe the fact that those assemblies are multi-lines FASTA was an issue so I used seqtk to switch to single-line FASTA. Same result. I did try to index the input files with samtools beforehand. Same result. Interestingly, sometimes it complains about not being able to find the needed contig in the .fai file and sometimes in the .gzi file.

Now I tried yesterday to use a different machine to run the Singularity image (we never know) and while the outcome was the same, the first error message from pysam was Error reading temp/XXXXXXX/align/contigs_h1.fa.gz.gzi: No such file or directory. There were other related errors on previous runs saying Could not understand FASTA index. It would make sense that if the index file does not exist when the job gets started, it cannot find the required contigs in it. Yet, I checked the index file and it is where it is supposed to be. Is it possible the issue is related to some job dependency problem in Snakemake? Like some job using the index runs while either the index is being written or deleted?

Unfortunately, those assemblies are sensitive data. I can try to get approval to share them but I am not too hopeful (and it might take a while).

GuillaumeHolley commented 2 days ago

I tried the PAV 2.3.4 Singularity image and it ran smoothly.

paudano commented 2 days ago

In 2.3.4, PAV always re-writes the input FASTA files (stores a whole copy of it in the run directory). Later versions of PAV link to FASTAs if it can. I think this is is the reason it's working in 2.3.4.

I just released 2.4.2. This version adds parameter "no_link_qry" (config.json); set to "True" to force-rebuild the FASTA like the old version. I'd recommend using the newer version if you can. If you try this, let me know if it solves the issue.

PAV 3 is being developed with some major improvement. Sounds like this is a pretty big project, and if I can get a beta version of PAV 3 out quickly enough, would you be interested in trying it?

GuillaumeHolley commented 2 days ago

Thanks. I pulled the 2.4.2 image and no matter whether I use the no_link_qry parameter or not in the config.json, I get the following error right at the beginning before anything starts:

Config file config.json is extended by additional config specified via the command line.
Assuming unrestricted shared filesystem usage.
host: XXXXXXX
Building DAG of jobs...
MissingInputException in rule call_integrate_sources in file /opt/pav/rules/call.snakefile, line 492:
Missing input files for rule call_integrate_sources:
    output: results/XXXXXXX/bed_hap/pass/h1/svindel_ins.bed.gz, results/XXXXXXX/bed_hap/pass/h1/svindel_del.bed.gz, results/XXXXXXX/bed_hap/pass/h1/sv_inv.bed.gz, results/XXXXXXX/bed_hap/pass/h1/snv_snv.bed.gz, results/XXXXXXX/bed_hap/pass/h1/fa/svindel_ins.fa.gz, results/XXXXXXX/bed_hap/pass/h1/fa/svindel_del.fa.gz, results/XXXXXXX/bed_hap/pass/h1/fa/sv_inv.fa.gz, temp/XXXXXXX/bed_hap/fail/h1/svindel_ins.bed.gz, temp/XXXXXXX/bed_hap/fail/h1/svindel_del.bed.gz, temp/XXXXXXX/bed_hap/fail/h1/sv_inv.bed.gz, temp/XXXXXXX/bed_hap/fail/h1/snv_snv.bed.gz, temp/XXXXXXX/bed_hap/fail/h1/fa/svindel_ins.fa.gz, temp/XXXXXXX/bed_hap/fail/h1/fa/svindel_del.fa.gz, temp/XXXXXXX/bed_hap/fail/h1/fa/sv_inv.fa.gz
    wildcards: asm_name=XXXXXXX, hap=h1
    affected files:
        temp/XXXXXXX/lg_sv/sv_del_h1.bed.gz
        temp/XXXXXXX/lg_sv/sv_ins_h1.bed.gz
        results/XXXXXXX/align/trim-tig/depth_tig_h1.bed.gz

I would definitely be interested in trying out PAV3 beta. We have 300+ assemblies.

paudano commented 2 days ago

Ugh, I built the wrong branch into the containers. Fixing now...

paudano commented 2 days ago

I just pushed 2.4.3. Thanks for reporting these issues.

GuillaumeHolley commented 1 day ago

So I just tried the 2.4.3 version with "no_link_qry" set to "True" in config.json. Same errors as usual at the same location :(

It is still trying to read the index files from "temp/XXXXXXX/align/contigs_h1.fa.gz.gzi" or "temp/XXXXXXX/align/contigs_h1.fa.gz.fai" (same for h2) and still fails to do so. The error is always Error reading temp/XXXXXXX/align/contigs_h1.fa.gz.gzi: No such file or directory or Could not understand FASTA index.