(B) Protein coding genes and other items from the annotation (fast, needs A)

mrshu commented 6 years ago

Download genome annotation in GFF format, process to genepred format, split into two tracks: genes and other items
Last year done by 2 groups based on 2 different databases:
https://github.com/fmfi-genomika/genomika-2017/wiki/Protein-coding-genes-from-NCBI-RefSeq
https://github.com/fmfi-genomika/genomika-2017/wiki/Protein-coding-genes-from-EnsembleFungi
Coordinate with renaming chromosomes in step (1)
In the first pass, use last-year scripts to convert formats, then load the tracks. Later we will work on polishing details, e.g.:
- Use appropriate IDs for naming genes
- mRNA items in other item track are redundant, should be omitted
- also items covering entire chromosome (type=region) should be omitted
- protein coding genes could be displayed with codon highlighting - use the following settings in trackDb.ra:

baseColorUseCds given
baseColorDefault genomicCodons

mrshu commented 6 years ago

@vidriduch There is quite cryptic bulletpoint in the description of this task:

Coordinate with renaming chromosomes in step (1)

Can you provide some details on how the chromosomes were renamed in #1 ? It seems to be fairly important here.

Thanks!

vidriduch commented 6 years ago

@mrshu I belive that when we downloaded the fasta file their were in that normal fasta format ><id> <description>(note, the space in between). We just removed the description part. This was done using some browser command. We converted fasta to binary with something like fasta2bincommand and from this back to fasta with bin2fastaI believe. This removed everything after the space (description part). Otherwise the id were left untouched. This process was suggested by teachers. The chromosome names (id) is visible in the browser or in the fasta file.

mrshu commented 6 years ago

Thanks @vidriduch, this should be figured out (at least to some extend) now.

We also have a Wiki page which should make this reproducible: https://github.com/fmfi-genomika/genomikaMalGlo/wiki/Protein-coding-genes-from-NCBI-RefSeq-(malGlo)

I believe the first pass should be done, even with codon highlighting: http://genomika.compbio.fmph.uniba.sk/cgi-bin/hgTracks?db=malGlo1&lastVirtModeType=default&lastVirtModeExtraState=&virtModeType=default&virtMode=0&nonVirtPosition=&position=NW_001849855.1%3A1200-2200&hgsid=323_Cl6ljmzYCQyGUo4IMAxUzCAq2Av7

@bbrejova I also used the PROTEIN_ID values for naming shown genes: instead of cds87 they would be now called for instance XP_001728661.1. I have a few questions here:

Does something like that make sense? I am happy to change it if not.
It seems we do not have genes with type=region in our .gff file. Is that possible or is it just that I made a mistake somewhere?
Other than that, I believe this task is pretty much done. Could you please confirm that?

Thanks!

bbrejova commented 6 years ago

(1) Protein-coding track as such looks good. It can be now used in track M etc.

(2) Documentation has some minor mistakes, e.g. " Name of my file in gtf format is saccer.gtf" - should be malGlo.gtf? "Check whether the PROTEIN_ID is unique per each chromosome name" - should be per transcript_Id?

(3) Track RefSeq Other (local) unfortunately probably contains only useless information. For example, id123 covers almost the whole contig NW_001849855.1, I assume that there are such regions covering whole contigs on other contigs as well. It also contains features called rna90 etc., which are copies of protein-coding genes but without introns. These are also not useful. I have not found any other item type which should be left in the track, therefore I suggest deleting this track and noting this clearly in the documentation.

mrshu commented 6 years ago

Thanks @bbrejova, all of your comments should be addressed now. I am therefore closing this task. Please feel free to reopen it if you do not agree.

Thanks!

fmfi-genomika / genomikaMalGlo

(B) Protein coding genes and other items from the annotation (fast, needs A) #2