eldariont / svim

Structural Variant Identification Method using Long Reads
GNU General Public License v3.0
154 stars 19 forks source link

Suggestion: Add BEDPE as an optional output #23

Closed Dfupa closed 5 years ago

Dfupa commented 5 years ago

Hi!

I am currently trying to implement SVIM into a pipeline that requires BEDPE files as an input for misassembly detection. At the moment I can manage converting the VCF to BEDPE but I wonder If you might consider adding in the future an option to directly output a BEDPE file from SVIM.

eldariont commented 5 years ago

Hi Diego,

thanks for you question. Are you interested in a particular SV class? SVs are very heterogenous and some classes, such as deletions, insertions and inversions are characterized by only one region. That's why SVIM already outputs them in BED format (https://github.com/eldariont/svim/wiki#candidatescandidates_bed). Other classes, particularly duplications and translocations are characterized by two regions which can even be on different chromosomes. For those classes, BEDPE output would make sense.

Maybe you could explain a bit more what precise information you need from SVIM for the misassembly detection so that I can understand better?

Cheers David

Dfupa commented 5 years ago

Hi David,

I intend to use the SV data produced by SVIM to differentiate real misassemblies from fake ones caused by structural differences between the reference sequence and my sequenced organism. For the current iteration of the pipeline I am mostly interested in using the first seven columns of the BEDPE format, similarly to how QUAST (one of the tools used in the pipeline) does it. Chrom1, start1, end1 which would define a confidence interval around SV start, chrom2, start2, end2 would define a confidence interval around SV end. The Name column is used to assign the SV category by checking if it contains the following substrings INV and DEL. Regarding duplications and translocations, they are automatically identified if Chrom1 =/= Chrom2. We are still working on solving the insertions.

Example of a BEDPE line for a deletion: h.sapiens.chr11 10012 10130 h.sapiens.chr11 10086 10205 DELETION

Example of a BEDPE line for a translocation: h.sapiens.chr11 10012 10130 h.sapiens.chr18 45687 45805 Name_not_necessary

Thank you for your quick reply! Diego

eldariont commented 5 years ago

Hi Diego,

thanks for explaining in more detail. At the moment, SVIM does not output confidence intervals around SV breakpoints but only the most likely breakpoint from a given set of read alignments. That's why SV candidates are written in bed and vcf format (one coordinate per breakpoint) but not bedpe (two coordinates per breakpoint). I'm not sure whether direct BEDPE output by SVIM would be of general interest also to other people.

But as you already mentioned, you can convert VCF/BED to BEDPE like this:

bcftools query -i 'SVTYPE=="DEL"' -f '%CHROM\t%POS\t%POS\t%CHROM\t%END\t%END\t%ID\n' final_results.vcf > deletions.bedpe
bcftools query -i 'SVTYPE=="INV"' -f '%CHROM\t%POS\t%POS\t%CHROM\t%END\t%END\t%ID\n' final_results.vcf > inversions.bedpe
awk 'OFS="\t" {split($4, f, ";"); split(substr(f[2], 2), g, ":"); print $1, $2, $2, g[1], g[2], g[2], "TRANS_"NR}' candidates/candidates_breakends.bed > translocations.bedpe

Does this work for you to convert the current SVIM output or is there another reason for SVIM to directly support BEDPE output?

Cheers David

Dfupa commented 5 years ago

Hi David,

Yes, I've been already converting VCF to BEDPE with a similar command. However I wondered if it might be worth to add it to SVIM output though it is true that It might become a particular option used for few selected tasks. Other than that, I don't have any complaints.

Thank you, regardless!