Open GuillaumeHolley opened 1 month ago
It looks like the assembly FASTA records were altered, was "H2" added to "H2#contig_6639" after assembly? If so, can you re-index the FASTA files (samtools faidx
)? It looks like the record names in the '.fai' file don't match the FASTA file, and I think this is why it's failing to read the FASTA file.
PAV only requires contig names to be unique per assembled haplotype (i.e. "contig_0001" may appear in both haplotypes 1 and 2). It's not necessary to add the haplotype to contigs for PAV, but perfectly fine to do that if you need it, just re-index FASTA files before calling PAV. From PAV calls, it's possible to trace each variant back to the correct assembly haplotype and contig coordinates without ambiguity.
Hi @paudano,
Thanks for getting back to me so quickly!
So the input files I have provided to PAV are not indexed (there are no .fai
files with the input FASTA). I also double-checked and H2#contig_6639
is a unique record in the second haplotype assembly FASTA, no other record exist with that name (the record also does not occur in the first haplotype assembly). The assembly pipeline we are using automatically prefixes each haplotig with the haplotype identifier it is from. It was actually made this way for downstream tools/pipelines processing the 2 haplotype assemblies at once and which, unlike PAV, would not be able to distinguish 2 haplotigs with the same name but from different FASTA files.
I need to see where the .fai
and .gzi
files are coming from. Can you share the output from running this?
file temp/XXXXXXX/align/contigs_h2.fa.gz*
I mistakenly erased the directory so I'm gonna restart the computation and report on the file you need info on.
In the meantime, I have been rerunning the exact same command I used but with an older version of PAV (2.2.4.1
) that I had downloaded some time ago and it has finished without errors.
I restarted the job and it failed at the same step. Here is the output of file temp/XXXXXXX/align/contigs_h2.fa.gz*
:
temp/XXXXXXX/align/contigs_h2.fa.gz: Blocked GNU Zip Format (BGZF; gzip compatible), block length 19953
temp/XXXXXXX/align/contigs_h2.fa.gz.fai: ASCII text
temp/XXXXXXX/align/contigs_h2.fa.gz.gzi: data
Similar for contigs_h1. Both index files .fai seem normal.
If the indexes are present, PAV links to the existing indexes, otherwise, they are generated with samtools faidx
(samtools 1.19 is in the container). For yours, PAV has generated the indexes. It's gotten through alignments, so minimap2 had no problems reading the FASTA, but it doesn't need the index since it reads all sequences. It's not until PAV tries to open the file for random access with pysam that it f ails (pysam 0.22.1 is in the container).
I have tried a bunch of ways to replicate the issue, but so far, I cannot. I created a small FASTA file with the exact contig, but so far, opening the FASTA and retrieving the sequence never fails even when using the container directly to do it. I've tried adding spaces before and after the name. I thought the "#" in the sequence name might be a problem, but I can't find anything that's not handling it correctly.
My best guess so far is when the contigs were renamed, spaces, special characters, or a mix of Windows (CR-LF) and Linux (LF) line breaks could have been introduced, but I can't be sure this is it. I'd zcat a few records to a plain-text file and look at it with hexedit to see if there's anything odd around the record IDs. If this is the case, it's possible one tool could read through these (minimap2) while another fails (pysam).
Are you able to share an assembly so that I can run and troubleshoot it? I can setup a secure Globus transfer. If not, I can try renaming contigs on one of my own samples to match yours and see if the error turns up in a whole PAV run, but I doubt this would reproduce the error based on my tests so far.
Hi @paudano,
I appreciate the work you've put in trying to reproduce the issue. I also did try a few things. At first, I thought maybe the fact that those assemblies are multi-lines FASTA was an issue so I used seqtk
to switch to single-line FASTA. Same result. I did try to index the input files with samtools
beforehand. Same result. Interestingly, sometimes it complains about not being able to find the needed contig in the .fai
file and sometimes in the .gzi
file.
Now I tried yesterday to use a different machine to run the Singularity image (we never know) and while the outcome was the same, the first error message from pysam was Error reading temp/XXXXXXX/align/contigs_h1.fa.gz.gzi: No such file or directory
. There were other related errors on previous runs saying Could not understand FASTA index
. It would make sense that if the index file does not exist when the job gets started, it cannot find the required contigs in it. Yet, I checked the index file and it is where it is supposed to be. Is it possible the issue is related to some job dependency problem in Snakemake? Like some job using the index runs while either the index is being written or deleted?
Unfortunately, those assemblies are sensitive data. I can try to get approval to share them but I am not too hopeful (and it might take a while).
I tried the PAV 2.3.4
Singularity image and it ran smoothly.
In 2.3.4
, PAV always re-writes the input FASTA files (stores a whole copy of it in the run directory). Later versions of PAV link to FASTAs if it can. I think this is is the reason it's working in 2.3.4
.
I just released 2.4.2
. This version adds parameter "no_link_qry" (config.json
); set to "True" to force-rebuild the FASTA like the old version. I'd recommend using the newer version if you can. If you try this, let me know if it solves the issue.
PAV 3 is being developed with some major improvement. Sounds like this is a pretty big project, and if I can get a beta version of PAV 3 out quickly enough, would you be interested in trying it?
Thanks. I pulled the 2.4.2
image and no matter whether I use the no_link_qry
parameter or not in the config.json
, I get the following error right at the beginning before anything starts:
Config file config.json is extended by additional config specified via the command line.
Assuming unrestricted shared filesystem usage.
host: XXXXXXX
Building DAG of jobs...
MissingInputException in rule call_integrate_sources in file /opt/pav/rules/call.snakefile, line 492:
Missing input files for rule call_integrate_sources:
output: results/XXXXXXX/bed_hap/pass/h1/svindel_ins.bed.gz, results/XXXXXXX/bed_hap/pass/h1/svindel_del.bed.gz, results/XXXXXXX/bed_hap/pass/h1/sv_inv.bed.gz, results/XXXXXXX/bed_hap/pass/h1/snv_snv.bed.gz, results/XXXXXXX/bed_hap/pass/h1/fa/svindel_ins.fa.gz, results/XXXXXXX/bed_hap/pass/h1/fa/svindel_del.fa.gz, results/XXXXXXX/bed_hap/pass/h1/fa/sv_inv.fa.gz, temp/XXXXXXX/bed_hap/fail/h1/svindel_ins.bed.gz, temp/XXXXXXX/bed_hap/fail/h1/svindel_del.bed.gz, temp/XXXXXXX/bed_hap/fail/h1/sv_inv.bed.gz, temp/XXXXXXX/bed_hap/fail/h1/snv_snv.bed.gz, temp/XXXXXXX/bed_hap/fail/h1/fa/svindel_ins.fa.gz, temp/XXXXXXX/bed_hap/fail/h1/fa/svindel_del.fa.gz, temp/XXXXXXX/bed_hap/fail/h1/fa/sv_inv.fa.gz
wildcards: asm_name=XXXXXXX, hap=h1
affected files:
temp/XXXXXXX/lg_sv/sv_del_h1.bed.gz
temp/XXXXXXX/lg_sv/sv_ins_h1.bed.gz
results/XXXXXXX/align/trim-tig/depth_tig_h1.bed.gz
I would definitely be interested in trying out PAV3 beta. We have 300+ assemblies.
Ugh, I built the wrong branch into the containers. Fixing now...
I just pushed 2.4.3
. Thanks for reporting these issues.
So I just tried the 2.4.3
version with "no_link_qry" set to "True" in config.json
. Same errors as usual at the same location :(
It is still trying to read the index files from "temp/XXXXXXX/align/contigs_h1.fa.gz.gzi" or "temp/XXXXXXX/align/contigs_h1.fa.gz.fai" (same for h2) and still fails to do so. The error is always Error reading temp/XXXXXXX/align/contigs_h1.fa.gz.gzi: No such file or directory
or Could not understand FASTA index
.
I'm not sure why this is happening. There's a bug somewhere that's only affecting you (so far), and I can't tell where or why. I'll see if I can figure out anything else that might be causing it.
I had an issue with similar symptoms - failing on call_cigar and failing to read the index files. However, for me the no_link_qry
parameter fixed the issue (Version 2.4.6
). Thank you for adding it.
Hi,
So I recently pulled the Singularity image of PAV
2.4.2.1
which I have used in parallel on a "large" number of assemblies. All the jobs eventually failed after some time so I restarted it on a single assembly which eventually failed too. For some reasons, the errors only got printed in the terminal but are not included in the Snakemake log file so I am copying them hereI am also enclosing the Snakemake log file if that is any useful. I would appreciate any feedback on this issue.
Thanks, Guillaume snakemake.log