BilkentCompGen / valor

variation discovery using long range information in linked-reads
BSD 3-Clause "New" or "Revised" License
15 stars 2 forks source link

Finding structural varianCommand terminated by signal 11 #11

Open morispi opened 4 years ago

morispi commented 4 years ago

Hello,

I am attempting to run Valor on a non-human dataset for which no default Sonic file is available.

I thus followed the guidelines on Sonic's GitHub page and created my own file. I generated two Sonic files, one with empty bed / out files passed to the --dups, --reps, and --gaps parameters, and another with files containing actual data passed to the parameters. First thing that seemed strange to me is that the Sonic file generated from empty bed/out files has a greater size that the other one. Is that an expected behaviour?

I generated the segmental duplication file using SEDEF, as mentioned here https://github.com/calkan/sonic/issues/11 , but since my assembly is not soft-masked, SEDEF generated an empty file. RepeatMasker ran correctly, and I ran a script of my own to find the gaps. Both gaps and repetitions files were thus correct and contained data.

Anyway, I then attempted to run Valor with these two Sonic files, to see if there would be any difference. However, Valor fails with both files, reporting that "Finding structural varianCommand terminated by signal 11". Both iterations of Valor fail on the same chromosome.

Do you know what might be causing this issue? Is there anything I'm doing wrong with Sonic files? I'm at a loss here.

Thanks in advance.

Best, Pierre

calkan commented 4 years ago

Hi

I will leave the VALOR segmentation fault issue to @f0t1h , but I can take a look at the SONIC related part of your bug report. Larger SONIC file with empty output is definitely not expected. Is it possible for you to share your files so I can try to pinpoint the problem?

morispi commented 4 years ago

Hi,

Thanks for the super quick reply! Here are the files I used to generate the SONIC file: DataSonic.tar.gz

calkan commented 4 years ago

the reference genome is missing :-) can you send the FASTA as well?

f0t1h commented 4 years ago

reporting that "Finding structural varianCommand terminated by signal 11".

My guess is, sonic->chromosome_names, array is null. I will double check tomorrow.

morispi commented 4 years ago

Hi again,

Sorry for the delay, I was checking wether or not I could share the reference genome. Unfortunately it is private and I cannot share it. I'm testing Valor / Sonic on a subset of the reference genome and will see if I can upload this subset.

Best, Pierre

calkan commented 4 years ago

well, I will need something to test. But a quick look at the gaps file shows a problem, it is not a valid BED file. you need something like:

chromosome1 45400 55400 ...

separated by tabs. those "edges=5300..117305 left=156508 right=143120 ver=1.10 style=3" parts have to go

morispi commented 4 years ago

I asked and I have the right to share a subset of my data. Here is an archive containing everything: ExampleDataset.tar.gz

The invalid BED file you mentioned came from the way my reference genome's chromosomes were named. I updated their name and tried to run Sonic / Valor again, but nothing changed, and I ended up with an empty Sonic file, and Valor stopping upon start.

calkan commented 4 years ago

well, at least for this set it looks fine. an you also try with the same subset?

with real files

sonic --ref reference.fasta --gaps Gaps/gaps.bed --reps Repeats/repeats.out --dups SegDups/segdups.bed --make-sonic ref.sonic Number of chromosomes: 1 Adding gap intervals to SONIC. Read 32 BED entries. Writing entries for chromosome 0 Wrote 32 entries. Adding segmental duplication intervals to SONIC. Read 0 BED entries. Writing entries for chromosome 0 Wrote 0 entries. Adding 2297 repeats to SONIC. Read 2294 BED entries. Writing entries for chromosome 0 Wrote 2294 entries. Adding GC profile SONIC. SONIC file ref.sonic is ready. Memory usage: 0.74 MB.

calkan@donut:~/tmp/sonic/DataSonic/ExampleDataset$ ls -l ref.sonic -rw-rw-r-- 1 calkan calkan 54635 Nov 6 16:29 ref.sonic

with null files

calkan@donut:~/tmp/sonic/DataSonic/ExampleDataset$ sonic --ref reference.fasta --gaps null.bed --reps null.bed --dups null.bed --make-sonic ref-null.sonic Number of chromosomes: 1 Adding gap intervals to SONIC. Read 0 BED entries. Writing entries for chromosome 0 Wrote 0 entries. Adding segmental duplication intervals to SONIC. Read 0 BED entries. Writing entries for chromosome 0 Wrote 0 entries. Adding 0 repeats to SONIC. Read 0 BED entries. Writing entries for chromosome 0 Wrote 0 entries. Adding GC profile SONIC. SONIC file ref-null.sonic is ready. Memory usage: 0.00 MB. calkan@donut:~/tmp/sonic/DataSonic/ExampleDataset$ ls -l ref-null.sonic -rw-rw-r-- 1 calkan calkan 26399 Nov 6 16:30 ref-null.sonic

morispi commented 4 years ago

I tried with this subset, yes.

Here is what I get:

./sonic --ref reference.fasta --gaps Gaps/gaps.bed --reps Repeats/repeats.out --dups SegDups/segdups.bed --make-sonic ref.sonic Number of chromosomes: 1 Adding gap intervals to SONIC. Wrote 32 entries. Adding segmental duplication intervals to SONIC. Wrote 0 entries. Adding repeats to SONIC.

ll ref.sonic -rw-rw-r-- 1 morispi morispi 0 nov. 6 14:38 ref.sonic

I tried on two different computers and got the same output.

morispi commented 4 years ago

Apparently, this could be caused by the version of Sonic, which is not the latest by default when cloning Valor. I manually pulled the latest version and managed to properly create the Sonic file. I will attempt to run Valor again, but need to update my BAM file, since my reference genome's headers contained spaces when the BAM file was generated, which seems to be incompatible with Valor.

Thanks a lot for your answers!

calkan commented 4 years ago

oh, I didn't notice it was a versioning issue. I will update the version in this repo then.

calkan commented 4 years ago

I updated the SONIC submodule in this repo. Let us know if you still have the segmentation fault problem. I won't be able to help with VALOR code itself, it will be in @f0t1h 's court if problem persists.

morispi commented 4 years ago

Hi again,

I modified my reference genome so it does not contain spaces, it thus only has simple headers such as ">0", ">1", etc. I re-ran LongRanger to generate a new BAM file of alignment betweens my reads and this new reference genome. I re-generated the Sonic file with the aforementioned new reference genome, and all went well.

I then tried to re-run Valor with these new files, and using 1 threads, it still reports that it ended by signal 11. Here are the last few lines of the log:

Finding structural variants in chromosome 27
Recovering split molecules..
Sorting the DNA Intervals
Recovering molecules
Global molecule depth mean: 132.279412
Global molecule depth standard deviation: 41.180752
Molecule Count: 6   Molecule mean: 18266.833333 Molecule std-dev: 10096.702409
Matching split molecules
2 candidate variations are made
0 candidate variations are left after filtering
Finding SV Clusters

Finding structural variants inCommand terminated by signal 11

    Command being timed: "/home/genouest/genscale/pmorisse/Valor/valor -i /scratch/supergene/mapping_onehap38/longranger_align/25_ShortHeaders/out/25/outs/possorted_bam.bam -o ResultsPapillon25Ref/res -s /home/genouest/genscale/pmorisse/Data/38_ShortHeaders.sonic -f INV,DUP,IDUP,TRA,ITRA,DEL -t 1"

This is not a memory issue, since I reserved 50GB, and Valor barely used 300MB. Is there anything else I am doing wrong?

Thanks in advance. Pierre

calkan commented 4 years ago

leaving this to @f0t1h

f0t1h commented 4 years ago

Can you send me the sonic file, so I can test what is wrong?

f0t1h commented 4 years ago

Also, can you send all logs from the beginning? @morispi

morispi commented 4 years ago

Here is an archive containing the Sonic file and the logs from the run: Issue11Data.tar.gz

I ran again on a single chromosome in order to get a smaller sonic / log. Tell me if you need anything else.

f0t1h commented 4 years ago

Does this error happen always on first chromosome in the sample, or randomly?

morispi commented 4 years ago

Well the data I shared only contains one chromosome, so it always happens on this one.

morispi commented 4 years ago

Apart from that, on a regular file with multiple chromosomes, it seems to happen randomly.

f0t1h commented 4 years ago
#include <stdio.h>
#include <stdlib.h>
#include "sonic/sonic.h"

int main(int argc, char **argv){

    sonic *snc = sonic_load(argv[1]);
    printf("%s\n",snc->chromosome_names[0]);
    printf("%lf\n",sonic_is_segmental_duplication(snc,snc->chromosome_names[0],1000,50000));
    printf("%d\n",sonic_is_gap(snc,snc->chromosome_names[0],1000,50000));
    printf("%lf\n",sonic_is_satellite(snc,snc->chromosome_names[0],1000,50000));
    free_sonic(snc);
    return 0;
}

gcc test.c sonic/libsonic.a -lz -g -o sonic_test

and

valgrind -v ./sonic_test 38_SuperGene_ShortHeaders.sonic

I run above code loading the sonic file you provided and got no errors. Can you try the same? Sonic file looks healthy. Is it possible that there is a corruption in the bam file?

morispi commented 4 years ago

Here is the output from Valgrind: Valgrind.log

I don't think the BAM file is corrupted, since it was generated recently, with LongRanger 2.2.2, and since LongRanger did not report any error while running. Is there any way I can check if the BAM file is corrupted though?

f0t1h commented 4 years ago

You can use samtools quickcheck . If it doesn't fail there you can use samtools view to write it to another bam file (if it is not too large).

I will be offline now since it is past 2AM here. I will follow up tomorrow.

morispi commented 4 years ago

Samtools quickcheck did not report any error. I tried to use samtools view to write my bam file to another bam file, and attempted to run Valor on that new bam file again. It still ended with the same error.

No problem with following up tomorrow, thanks for your answers!

f0t1h commented 4 years ago

Can you set --contig_count or -c to number of chromosomes in your reference? By default, it assumes Human sequencing and runs first 24 (22+ X +Y) chromosomes. It may be the cause of your problem. If this fixes it, I will update the software accordingly.

morispi commented 4 years ago

Sorry for not answering yesterday, it was a holiday in France. Running valor with the option --contig_count seems to resolve the issue! I just realized this option is actually mentioned in the --help of Valor, but I stuck to the README only to help me launch Valor. My bad. However, maybe you should add all options to the README? I think that may help other people as well.

Anyway, thank you very much for your answers!

morispi commented 4 years ago

Edit: Well, it seemed to work at first... It still terminated with signal 11, but much later that previous runs. I tried setting the --contig_count to a higher value but it did not seem to change anything. Moreover, Valor seems to be running extremely fast compared to other experiments on human data I performed. He is the complete log if you want to take a look: Valor.log

f0t1h commented 4 years ago

I am in process of cleaning the command line interface, once done I will update the README as well.

Can you let me know minimum number of bases in these contigs? Valor is designed for large variants, problem might be because of the smaller contigs. I will update the code tomorrow to check this.

morispi commented 4 years ago

Sorry, it seems like I forgot to reply.

The main contig of interest is about 3.9Mbp. I managed to run Valor successfully on this contig only, by subsampling my BAM and FASTA files. However Valor did not output any variant. Is it because the contig is too small?