PapenfussLab / StructuralVariantAnnotation

R package designed to simplify structural variant analysis
GNU General Public License v3.0
68 stars 15 forks source link

Biologically Annotate Insertions #12

Closed DarioS closed 5 years ago

DarioS commented 5 years ago

I wonder if annotations such as LINE1 retrotransposon / Alu retrotransposon / SVA retrotransposon / HPV virus might be feasible or desirable for the inserted sequences. I've found that MELT runs for longer than the HPC allows for deeply sequenced samples, so I'm looking for an alternative to annotating non-human sequences being inserted into the human genome.

d-cameron commented 5 years ago

This capability is provided by the gridss.AnnotateUntemplatedSequence utility. It supports any VCF in BND notation (although I've only tested with GRIDSS). We've successfully used it on a 4000 WGS cancer cohort to detect viral insertions and to classify single breakends with repeat annotation (mostly single breakends are SVs into L1 or centromeric repeats).

The single breakend repeat annotation script is available at https://github.com/hartwigmedical/scripts/blob/master/gridss/gridss_annotate_insertions_repeatmaster.R. Note that you'll need run gridss.AnnotateUntemplatedSequence against the reference genome for this script to work.

We should have a pre-print on our pipeline (including GRIDSS2 features) out in about a month.

DarioS commented 5 years ago

I am interested to try it on the cancer data set I have access to. I was reading the code of the R script for annotating insertions and I saw a table of repeats defined for the variable repeat_notes. But, it's not used anywhere else in the script, so seems to not be relevant. I also see a FASTA file from the RepeatMasker project is used. Is that the source of the retrotransposon annotations? Will there be an example of how to search for viruses in the preprint? When I execute java -cp $GRIDSS_JAR gridss.AnnotateUntemplatedSequence -H I don't see REFERENCE_SEQUENCE documented, but I get an error if it's not specified. I also thought that INPUT and OUTPUT could be the same file, but this causes the input file's contents to be erased with a cryptic error message.

Unable to parse header with error: Your input file has a malformed header: We never saw the required CHROM header line

If it is intentionally invalid to annotate a VCF in place, could the user parameters be validated and stop with an error before variant file erasing occurs? In place annotation seems desirable to make possible, though.

d-cameron commented 5 years ago

I was reading the code of the R script for annotating insertions and I saw a table of repeats defined for the variable repeat_notes. But, it's not used anywhere else in the script, so seems to not be relevant.

You are correct that it's not used. That field is not included in the output as it was based on repeat classifications that were useful to me, but it is not part of the repeatmasker annotations.

d-cameron commented 5 years ago

I also see a FASTA file from the RepeatMasker project is used. Is that the source of the retrotransposon annotations?

It is the repeatmasker .fa.out file. The pre-built ones can be downloaded from http://www.repeatmasker.org/species/hg.html