calkan / sonic

Some Organism's Nucleotide Information Container
6 stars 1 forks source link

SONIC annotation file built using UCSC information failing for C. elegans (not when using Wormbase reference FASTA?) #13

Closed moldach closed 4 years ago

moldach commented 4 years ago

I'm getting different SONIC files from the reference genome of C. elegans depending on the source (i.e. UCSC and Wormbase). Using all of the references from UCSC fails using the tardis SV caller (which requires the SONIC annotation file).

As there are no duplication/gap annotations for C. elegans we only need to download the following and touch empty duplication/gap annotation files #12

1 - Download chromFa FASTA files into a folder and merge them into a single .fasta file:

wget http://hgdownload.cse.ucsc.edu/goldenPath/ce11/bigZips/chromFa.tar.gz; tar -zxvf chromFa.tar.gz
cat * >ref.fasta

Create index file in the same folder:

samtools faidx ref.fasta

2 - Download the RepeatMasker .out files into a folder and merge them into a single .out file:

cat */*.fa.out >reps.out

3 & 4. Touch files for missing annotations

touch gaps.bed
touch dups.bed

When you merge using the UCSC reference:

sonic --ref ref.fasta --dups dups.bed --reps reps.out --gaps gaps.bed --make-sonic UCSC.sonic --info "ce11-UCSC"

You get a size of 1.7M

If you download the reference genome file from [Wormbase] however, then try to create the SONIC file you get a SONIC file of only 665K.

wget ftp://ftp.wormbase.org/pub/wormbase/releases/WS274/species/c_elegans/PRJNA13758/c_elegans.PRJNA13758.WS274.genomic.fa.gz
gzip -d c_elegans.PRJNA13758.WS274.genomic.fa.gz
samtools faidx c_elegans.PRJNA13758.WS274.genomic.fa

sonic --ref c_elegans.PRJNA13758.WS274.genomic.fa --dups dups.bed --reps reps.out --gaps gaps.bed --make-sonic wormbase.sonic --info "ce11-wormbase"

As these SONIC files are binary I'm not sure how to investigate the differences between them. However, using the UCSC fasta files results in an error when using tardis.

tardis-nocram -i file.bam --ref c_elegans.PRJNA13758.WS265.genomic.fa \
        --sonic UCSC.sonic --out wormbase-c11

Reference FASTA and SONIC file do not match. Check if you are using the same version.

Using the SONIC file generated from the Wormbase reference will not cause the error:

tardis-nocram -i file.bam --ref c_elegans.PRJNA13758.WS265.genomic.fa \
        --sonic wormbase.sonic --out wormbase-c11

Wondering if you've seen this before? This should probably be an issue that I bring up with the folks at UCSC as well as with the developer for tardis as well but I'm making it an issue here and will direct them here.

calkan commented 4 years ago

Hi

This is a "classic" example of the human assemblies GRC vs UCSC. The UCSC assembly you downloaded have chromosome names chrI, chrII, etc. The other one names the chromosomes I, II, etc. I am guessing you are using the same RepeatMasker annotations, which then causes mismatch between chromosome names. Since RepeatMasker annotation will have chrI, and the FASTA will have I; they won't match and the repeats will be discarded. The same mismatch will cause TARDIS to skip annotations since as far as it is concerned, there are no repeats in "I", changing the output. Same issue will arise if the chromosome names in BAM file don't match with chromosome names in FASTA and/or RepeatMasker.

I wish there were an easy way to seamlessly fix this issue within SONIC, but it is harder than it looks especially for human, since there are also alternative haplotypes that are named very very differently. The easiest way for you, if the assembly composition is indeed the same, just rename one of your sources. You can remove "chr" from RepeatMasker annotations for example, or add them to FASTA.

moldach commented 4 years ago

Thank you for prompt response and for clearing this up