fmfi-genomika / genomikaMalGlo

Malassezia globosa
0 stars 0 forks source link

(B) Protein coding genes and other items from the annotation (fast, needs A) #2

Closed mrshu closed 6 years ago

mrshu commented 6 years ago
baseColorUseCds given
baseColorDefault genomicCodons
mrshu commented 6 years ago

@vidriduch There is quite cryptic bulletpoint in the description of this task:

Coordinate with renaming chromosomes in step (1)

Can you provide some details on how the chromosomes were renamed in #1 ? It seems to be fairly important here.

Thanks!

vidriduch commented 6 years ago

@mrshu I belive that when we downloaded the fasta file their were in that normal fasta format ><id> <description>(note, the space in between). We just removed the description part. This was done using some browser command. We converted fasta to binary with something like fasta2bincommand and from this back to fasta with bin2fastaI believe. This removed everything after the space (description part). Otherwise the id were left untouched. This process was suggested by teachers. The chromosome names (id) is visible in the browser or in the fasta file.

mrshu commented 6 years ago

Thanks @vidriduch, this should be figured out (at least to some extend) now.

We also have a Wiki page which should make this reproducible: https://github.com/fmfi-genomika/genomikaMalGlo/wiki/Protein-coding-genes-from-NCBI-RefSeq-(malGlo)

I believe the first pass should be done, even with codon highlighting: http://genomika.compbio.fmph.uniba.sk/cgi-bin/hgTracks?db=malGlo1&lastVirtModeType=default&lastVirtModeExtraState=&virtModeType=default&virtMode=0&nonVirtPosition=&position=NW_001849855.1%3A1200-2200&hgsid=323_Cl6ljmzYCQyGUo4IMAxUzCAq2Av7

@bbrejova I also used the PROTEIN_ID values for naming shown genes: instead of cds87 they would be now called for instance XP_001728661.1. I have a few questions here:

  1. Does something like that make sense? I am happy to change it if not.
  2. It seems we do not have genes with type=region in our .gff file. Is that possible or is it just that I made a mistake somewhere?
  3. Other than that, I believe this task is pretty much done. Could you please confirm that?

Thanks!

bbrejova commented 6 years ago

(1) Protein-coding track as such looks good. It can be now used in track M etc.

(2) Documentation has some minor mistakes, e.g. " Name of my file in gtf format is saccer.gtf" - should be malGlo.gtf? "Check whether the PROTEIN_ID is unique per each chromosome name" - should be per transcript_Id?

(3) Track RefSeq Other (local) unfortunately probably contains only useless information. For example, id123 covers almost the whole contig NW_001849855.1, I assume that there are such regions covering whole contigs on other contigs as well. It also contains features called rna90 etc., which are copies of protein-coding genes but without introns. These are also not useful. I have not found any other item type which should be left in the track, therefore I suggest deleting this track and noting this clearly in the documentation.

mrshu commented 6 years ago

Thanks @bbrejova, all of your comments should be addressed now. I am therefore closing this task. Please feel free to reopen it if you do not agree.

Thanks!