fmfi-genomika / genomikaMalGlo

Malassezia globosa
0 stars 0 forks source link

(O) Differences between strains (slow, needs A) #15

Open mrshu opened 6 years ago

mrshu commented 6 years ago

in trackDb.ra include something like this:

composite track:

track track_name compositeTrack on type vcfTabix shortLabel ... longLabel ... group varRep visibility hide

subtrack:

track subtrack_name shortLabel ... longLabel ... parent track_name visibility pack

mrshu commented 6 years ago

@matuszelenak Any progress here? Let me know if I can be of any help.

matuszelenak commented 6 years ago

I'll get around to it tomorrow morning hopefully

matuszelenak commented 6 years ago

I have a problem with identifying chromosomes in the strain sequences. Even though C-sibelia seems to finish successfully, tabix fails because "chromosome blocks not continuous" .

When I look at the original .fna files I'm confused as to which part of the sequence description is denoting the chromosome (if there even is any). For example in this:

>LFDC01000055.1 Malassezia globosa strain CBS 7874 MG7874_1230, whole genome shotgun sequence

Any ideas?

matuszelenak commented 6 years ago

It looks like C-Sibelia takes the first part of the description (>LFDC01000055.1) as the chromosome name. I checked the file it generates and from what I can tell, the blocks for each chromosome are continuous. Really clueless now

bbrejova commented 6 years ago

Indeed LFDC01000055.1 is a contig name, which is our equivalent of chromosome (we do nto have whole chromosomes available). Where is the (vcf ?) file that C-sibelia generated?

matuszelenak commented 6 years ago

So far only in my home folder /home/z/zelenak17/O

bbrejova commented 6 years ago

It worked for me when I first sorted file by contig number and coordinate: sort -k1,1 -k2,2g variant_CBS7874.vcf > variant_CBS7874.new.vcf

matuszelenak commented 6 years ago

So I created a table in malGlo1 sql database called strain_difference containing the filename of the vcf.gz file and added the corresponding entry to trackDB.ra according to the new part of the manual.

I'm not sure if the track is displaying correctly however. When I bring up the detailed description, it does show sample differences between the strains as text, but I can't see anything graphically in the browser.

bbrejova commented 6 years ago

It seems the whole vcf file is not done correctly. The malGlo genome in the browser has sequence names of the form NW_001849832.1, while the vcf file has sequence names of the form AAYY01000050.1

matuszelenak commented 6 years ago

Well...I really don't know how to deal with this. Is there any way to translate between these two naming conventions?

bbrejova commented 6 years ago

I seems to mee AAYY IDs come from an older assembly (a different fasta file). See the source genbank records: https://www.ncbi.nlm.nih.gov/nuccore/AAYY01000050.1/ vs one of the contigs from the browser: https://www.ncbi.nlm.nih.gov/nuccore/NW_001849855.1/ The correct fasta file for the genome is on the genomika server in /gbdb/malGlo1/ directory.

The reference assembly should be the one from the browser, the other one should be another strain.

matuszelenak commented 6 years ago

The file in /gbdb/malGlo1 has the NW names, while the strains have LFDC/LFGF names, so this still solves nothing :(

bbrejova commented 6 years ago

I assume that c-sibelia accepts 2 fasta files and produces a vcf files with IDs from one of the 2 fasta files. One of the fasta files is the one from /gbdb.malGlo1 and the other is another strain downloaded from NCBI. You have to run it so that the IDs in vcf come from the fasta file in /gbdb/malGlo1.

However, I have now noticed that there is a mistake in the task above, because it tells you to compare the reference strains with CBS 7966, CBS 7874, but actually CBS 7966 is the reference. So you can either compare it only with CBS 7874, or possibly also download CBS 7990 (optional).

matuszelenak commented 6 years ago

I assumed that we can also compare two assemblies of the same strain...oh well.

Looks like it's finally kinda working. The question now is, how to make the track structure work properly. I assume that parent track is "virtual" in a sense, and the child tracks should represent the two strains.

From the description in the original post I don't see how the browser should understand where to load from the respective tracks for each strain.

In the browser it shows up as segments of two rows that mark the difference, I assume this only shows the difference of one straing against the reference. How do I make it show both the strains?

trackDB.ra config

track strain_difference
compositeTrack on
visibility hide
type vcfTabix
shortLabel Strains
longLabel Difference against CBS7990 and CBS7874 strain
group varRep
html strain_difference

track CBS7874
shortLabel CBS7874
longLabel CBS7874 strain (L)
parent strain_difference
visibility pack

track CBS7990
shortLabel CBS7990
longLabel CBS7990 strain (L)
parent strain_difference
visibility pack

And the strain_difference table in hgsql contains the paths to the correct files.

matuszelenak commented 6 years ago

I see the other group just gave up on this approach and simply made a table in DB for every strain.

Guess I'll follow their lead :D

bbrejova commented 6 years ago

Several comments: (1) Yes, it is possible to compare two assemblies for the same strain, but perhaps less useful for general audience than comparing two strains. But still IDs in vcf must come from the assembly used in the browser, not the other one. (2) I think you indeed need a separate table for each strain (as you have tables CBS7874 and CBS7990), but they can be still joined to a single composite track exactly as in your snippet of trackDb.ra above. The parent track strain_difference is not in db, Currently the browser has two separate tracks, What happens when you use the trackDb with composite tracks and parents?