cytham / nanovar

Structural variant caller for low-depth long-read sequencing data
GNU General Public License v3.0
45 stars 10 forks source link

bedtools getfasta error #89

Open C-YONG opened 3 months ago

C-YONG commented 3 months ago

Hello, software developer. I have three ONT datasets in total. The first two ran normally, but the third one encountered the following error. What could be the reason for this?

[24/07/2024 15:27:11] - NanoVar started [24/07/2024 15:27:11] - Checking integrity of input files - Pass [24/07/2024 15:27:12] - Analyzing read alignments and detecting SVs - Done [24/07/2024 15:52:39] - Clustering SV breakends - Done [24/07/2024 16:08:29] - Correcting DUP and detecting TE - Done [24/07/2024 16:13:00] - Generating VCF files and report - Traceback (most recent call last): File "/home/chiyong/miniconda3/envs/nanovar/bin/nanovar", line 635, in main() File "/home/chiyong/miniconda3/envs/nanovar/bin/nanovar", line 486, in main run.vcf_report() # ^^^^^^^^^^^^^^^^ File "/home/chiyong/miniconda3/envs/nanovar/lib/python3.11/site-packages/nanovar/nv_characterize.py", line 205, in vcf_report alt_seq = get_alt_seq(self.dir, self.out_nn, self.refpath) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/chiyong/miniconda3/envs/nanovar/lib/python3.11/site-packages/nanovar/nv_alt_seq.py", line 46, in get_alt_seq fasta = bed.sequence(fi=ref_path, nameOnly=True) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/chiyong/miniconda3/envs/nanovar/lib/python3.11/site-packages/pybedtools/bedtool.py", line 907, in decorated result = method(self, *args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/chiyong/miniconda3/envs/nanovar/lib/python3.11/site-packages/pybedtools/bedtool.py", line 388, in wrapped stream = call_bedtools( ^^^^^^^^^^^^^^ File "/home/chiyong/miniconda3/envs/nanovar/lib/python3.11/site-packages/pybedtools/helpers.py", line 456, in call_bedtools raise BEDToolsError(subprocess.list2cmdline(cmds), stderr) pybedtools.helpers.BEDToolsError: Command was:

    bedtools getfasta -nameOnly -fo /tmp/pybedtools.fvq2ujze.tmp -fi 03_genome/genome_NV/genome1.3.fna -bed /tmp/pybedtools.ml_5c4kf.tmp

Error message was: Feature (LR584250.1:14620732-14620733) beyond the length of LR584250.1 size (12913240 bp). Skipping.

Writing to /tmp/bcftools.VhKjGh

LaurieLecomte commented 3 months ago

Hi, I encountered the same issue using version 1.7.0 with mapped PacBio ccs reads. I was able to run a previous version of NanoVar (1.3.8) on the same data without any problem.

I use the following command, which throws the 'feature beyond the length' message at the VCF generation step: nanovar $BAM $GENOME_NV $CALLS_DIR/nanovar/$SAMPLE -x pacbio-ccs -t $CPU -f $EXCL_BED --debug

The last files outputted are cluster.tsv and parse2.tsv (empty).

Installation was done with conda as described in the README.

Thank you for any guidance or advice.

cytham commented 3 months ago

Hi @C-YONG @LaurieLecomte, thanks for reporting this. I'll need some time to check on this and will revert in the next few days. Thank you for your patience.

cytham commented 3 months ago

Hi @C-YONG and @LaurieLecomte, please try the v1.7.1-dev branch and let me know if it fixes the issue.

You can clone the branch by: git clone -b v1.7.1-dev https://github.com/cytham/nanovar.git

LaurieLecomte commented 3 months ago

Hi @cytham, thank you very much for the quick response and the update.

Unfortunately, I got the exact same error and outputs using the v1.7.1-dev version.

I initially suspected that the bug might come from Inter-Ins(2) and InterTx, but after investigating a bit on my end, it looks like some calls of Inv(2) and Intra-Ins(2) also give the same error.

I made a custom version of the make_bed() function from v1.7.0 to output an intermediary bed file featuring SV type and the original id~chrom:start-end string from parse1.tsv, and below I pasted a few of the lines representing sites that threw the beyond chromosome length error. For Intra-Ins(2) and Inv(2) sites, in the id~chrom:start-end strings, the start position is greater than the end position:

CM055694.1      56110901        56110902        EC435   Inter-Ins(2)    EC435~CM055694.1:56110902~CM055721.1:17888839
CM055703.1      67666256        67666257        B895F   Intra-Ins(2)    B895F~CM055703.1:67666257-49693072
CM055703.1      67666256        67666257        B895F   Intra-Ins(2)    B895F~CM055703.1:67666257-49693072
CM055703.1      67666256        67666257        D6D6D   Intra-Ins(2)    D6D6D~CM055703.1:67666257-49693072
CM055703.1      67666256        67666257        D6D6D   Intra-Ins(2)    D6D6D~CM055703.1:67666257-49693072
CM055704.1      52471483        52471484        E5B58   Inv(2)  E5B58~CM055704.1:52471484-3246722
CM055705.1      46612017        46612018        D22C1   InterTx D22C1~CM055705.1:46612018~CM055722.1:15467696
CM055707.1      43435500        43435501        E9F58   Intra-Ins(2)    E9F58~CM055707.1:43435501-24087228
CM055708.1      43435500        43435501        C66E2   InterTx C66E2~CM055708.1:43435501~CM055707.1:24087228
CM055711.1      62885703        62885704        71010   Inv(2)  71010~CM055711.1:62885704-42442184
CM055711.1      81492618        81492619        F0477   Inv(2)  F0477~CM055711.1:81492619-4879966

These start positions are also larger than the corresponding chromosome length, hence the error encountered.

I got the same result with make_bed() from v1.7.1, which gave the same output without Inter-Ins and InterTx (as expected).

Hope this helps.

cytham commented 3 months ago

Hi @LaurieLecomte, thanks for looking into this. Indeed, Intra-Ins and Inv could be causing the problem, my bad for missing that out. I have removed them in the latest v1.7.1-dev branch. Could you please try again?

The fact that start positions are greater than end positions in some cases shouldn't affect the BED file creation. So I am thinking that it could be another bug somewhere for Intra-Ins, Inv, etc. But thanks a lot for looking into this, you have been of great help.

Also, could you let me know the chromosome lengths of those you showed? (e.g. CM055703.1, CM055704.1, CM055705.1, CM055707.1, CM055708.1, CM055711.1)

LaurieLecomte commented 3 months ago

Hi @cytham, thanks for the update.

I tried the latest branch and everything ran without any issue, yielding complete VCFs and no errors.

Here are the chromosome lengths for the few problematic sites I shared in my previous comment (from the ASM2944872v1 assembly):

CM055694.1      60416927
CM055703.1      52113258
CM055704.1      42625803
CM055705.1      50983633
CM055707.1      43302116
CM055708.1      48152825
CM055711.1      46201713
cytham commented 3 months ago

Thanks @LaurieLecomte. I'm trying to figure out the root cause for this. Is it possible to share the debug files with me? Specifically, subdata.tsv, detect.tsv, and parse1.tsv? Thanks

LaurieLecomte commented 3 months ago

Hi @cytham, thank you for looking into this. I uploaded the files from a run with version 1.7.0 to Dropbox along with corresponding log files. Let me know if you also need input files (mapped reads and/or reference genome).

Flooooooooooooower commented 3 months ago

Hello, unfortunately, I encountered a "getfasta error" when using both versions 1.7.1 and 1.7.0 of nanovar with HG002 HiFi data. Do you have any solutions for this issue?

nanovar -x pacbio-ccs -t 30 -f ~/01.Reference/Human_GRCh37/nanovar_gap.bed  ./HG002.HIFI.sort.bam   ~/01.Reference/Human_GRCh37/02.ngmlr/GCF_000001405.25_GRCh37.p13_genomic.fa Nanovar_wd
cytham commented 3 months ago

hi @Flooooooooooooower, can you provide some logs please?

Flooooooooooooower commented 3 months ago

hi @Flooooooooooooower, can you provide some logs please?

Hello, thank you for your prompt response. Below is my log file.

 nanovar  --debug -l 50  -x pacbio-ccs -t 30 -f ~/01.Reference/Human_GRCh37/nanovar_gap.bed  ./HG002.HIFI.sort.bam   ~/01.Reference/Human_GRCh37/02.ngmlr/GCF_000001405.25_GRCh37.p13_genomic.fa Nanovar_wd
[09/08/2024 13:12:21] - NanoVar started
[09/08/2024 13:12:21] - Checking integrity of input files - Pass
[09/08/2024 13:13:03] - Analyzing read alignments and detecting SVs - Done
[09/08/2024 13:48:55] - Clustering SV breakends - Done
[09/08/2024 14:42:37] - Correcting DUP and detecting TE - Done
[09/08/2024 15:26:23] - Generating VCF files and report - Traceback (most recent call last):
  File "/home/xxx/miniconda3/envs/myenv/bin/nanovar", line 635, in <module>
    main()
  File "/home/xxx/miniconda3/envs/myenv/bin/nanovar", line 486, in main
    run.vcf_report() #
    ^^^^^^^^^^^^^^^^
  File "/home/xxx/miniconda3/envs/myenv/lib/python3.11/site-packages/nanovar/nv_characterize.py", line 205, in vcf_report
    alt_seq = get_alt_seq(self.dir, self.out_nn, self.refpath)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xxx/miniconda3/envs/myenv/lib/python3.11/site-packages/nanovar/nv_alt_seq.py", line 46, in get_alt_seq
    fasta = bed.sequence(fi=ref_path, nameOnly=True)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xxx/miniconda3/envs/myenv/lib/python3.11/site-packages/pybedtools/bedtool.py", line 907, in decorated
    result = method(self, *args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xxx/miniconda3/envs/myenv/lib/python3.11/site-packages/pybedtools/bedtool.py", line 388, in wrapped
    stream = call_bedtools(
             ^^^^^^^^^^^^^^
  File "/home/xxx/miniconda3/envs/myenv/lib/python3.11/site-packages/pybedtools/helpers.py", line 456, in call_bedtools
    raise BEDToolsError(subprocess.list2cmdline(cmds), stderr)
pybedtools.helpers.BEDToolsError:
Command was:

        bedtools getfasta -nameOnly -fo /tmp/pybedtools.xjnohstx.tmp -fi /home/xxx/01.Reference/Human_GRCh37/02.ngmlr/GCF_000001405.25_GRCh37.p13_genomic.fa -bed /tmp/pybedtools.l0vhhi9x.tmp

Error message was:
[NanoVar-090824-1312.log](https://github.com/user-attachments/files/16557678/NanoVar-090824-1312.log)

Feature (19:121485432-121485433) beyond the length of 19 size (59128983 bp).  Skipping.

NanoVar-090824-1312.log

cytham commented 3 months ago

@Flooooooooooooower I see your logs are from v1.7.0, can I please confirm that you get the same error for v1.7.1? thanks

Flooooooooooooower commented 3 months ago

@Flooooooooooooower I see your logs are from v1.7.0, can I please confirm that you get the same error for v1.7.1? thanks

Hi,Below is the log file of 1.7.1. [11/08/2024 16:27:02] - NanoVar-v1.7.1 started [11/08/2024 16:27:02] - Checking integrity of input files - Pass [11/08/2024 16:27:23] - Analyzing read alignments and detecting SVs - Done [11/08/2024 17:49:41] - Clustering SV breakends - Done [11/08/2024 18:35:33] - Correcting DUP and detecting TE - Done [11/08/2024 18:55:58] - Generating VCF files and report - Traceback (most recent call last): File "/opt2/conda/bin/nanovar", line 638, in main() File "/opt2/conda/bin/nanovar", line 489, in main run.vcf_report() # File "/opt2/conda/lib/python3.10/site-packages/nanovar/nv_characterize.py", line 205, in vcf_report alt_seq = get_alt_seq(self.dir, self.out_nn, self.refpath) File "/opt2/conda/lib/python3.10/site-packages/nanovar/nv_alt_seq.py", line 46, in get_alt_seq fasta = bed.sequence(fi=ref_path, nameOnly=True) File "/opt2/conda/lib/python3.10/site-packages/pybedtools/bedtool.py", line 907, in decorated result = method(self, *args, **kwargs) File "/opt2/conda/lib/python3.10/site-packages/pybedtools/bedtool.py", line 388, in wrapped stream = call_bedtools( File "/opt2/conda/lib/python3.10/site-packages/pybedtools/helpers.py", line 456, in call_bedtools raise BEDToolsError(subprocess.list2cmdline(cmds), stderr) pybedtools.helpers.BEDToolsError: Command was:

    bedtools getfasta -nameOnly -fo /tmp/pybedtools.1886eg8o.tmp -fi /home/xxx/01.Reference/Human_GRCh37/02.ngmlr/GCF_000001405.25_GRCh37.p13_genomic.fa -bed /tmp/pybedtools.74ascn07.tmp

Error message was: Feature (NT_167214.1:79186948-79186949) beyond the length of NT_167214.1 size (161802 bp). Skipping. Feature (NT_167214.1:31149863-31149864) beyond the length of NT_167214.1 size (161802 bp). Skipping. Feature (NT_167214.1:76807186-76807187) beyond the length of NT_167214.1 size (161802 bp). Skipping. Feature (NT_167214.1:76807695-76807696) beyond the length of NT_167214.1 size (161802 bp). Skipping. Feature (NT_167214.1:127650692-127650693) beyond the length of NT_167214.1 size (161802 bp). Skipping.

NanoVar-110824-1627.log

jiadong324 commented 1 month ago

I also got the same error while using v1.7.0. image

@cytham Is this issue resolved?

cytham commented 1 month ago

Hi @jiadong324, do you still get the error with the v1.7.1-dev branch?

jiadong324 commented 1 month ago

I just finished the test with v1.7.1-dev. No more errors! When will you release the v1.7.1.

cytham commented 1 month ago

NanoVar v1.8.1 is now available from PyPI and Conda, thanks for your understanding