More general methodology question

LehmannN commented 4 years ago

Hello again !

I would like to make sure I am not doing anything wrong with my analysis so if I could have some advices from you, that would be very kind. We have RNA-seq data from chick embryo and we try to improve the annotation because we saw ~40% of RNA-seq reads were lost during the gene assignment step. We postulated that a lot of these reads were lost due to poor annotation (we are working on a very specific and rare cell type), partially confirmed by the genome exploration view in a genome browser. We have both short and long reads for these cells. So we used StringTie and Scallop (separately, to make comparisons) to get an improved annotation and it works nice (we recovered ~20% of the reads), though we think we have a non-negligible amount of artefactual gene fusion. That's why we decided to use mikado to override this.

So now that I start using Mikado on this data, I have a couple of questions: 1- Regarding the scoring file, I choose to use the mammalian.yaml (the one in Mikado/configuration/scoring_files/HISTORIC) as the introns sizes in chick are similar as the ones in mammals. Do you think it's ok or there are some obvious caveat that you think I should take into account ? 2- Regarding the ORFs location, I generated the bed file with Transdecoder. So I followed the steps described in their wiki "Starting from a genome-based transcript structure GTF file (eg. cufflinks or stringtie)" up to TransDecoder.Predict. The last step (generate a genome-based coding region annotation file) does not seem relevant to run mikado. Does it seem ok to you ?
3- More generally, do you have any remarks or comment on the whole process of our analysis ?

Thanks a lot for your advices and help !

lucventurini commented 4 years ago

Dear @LehmannN

So now that I start using Mikado on this data, I have a couple of questions: 1- Regarding the scoring file, I choose to use the mammalian.yaml (the one in Mikado/configuration/scoring_files/HISTORIC) as the introns sizes in chick are similar as the ones in mammals. Do you think it's ok or there are some obvious caveat that you think I should take into account ?

I think it should be fine honestly, maybe @swarbred or @gemygk might have some insight about particular parameters to change. I have never annotated any bird genome, unfortunately, so I am not aware of any particular change that might be required. However, that scoring file should be robust enough for your purposes.

2- Regarding the ORFs location, I generated the bed file with Transdecoder. So I followed the steps described in their wiki "Starting from a genome-based transcript structure GTF file (eg. cufflinks or stringtie)" up to TransDecoder.Predict. The last step (generate a genome-based coding region annotation file) does not seem relevant to run mikado. Does it seem ok to you ?

Yes that's correct. The only thing is - I know it is an obvious remark but it is a common mistake" - please make sure that it is the mikado_prepared.fasta file that was used for TransDecoder.

3- More generally, do you have any remarks or comment on the whole process of our analysis?

None so far! Artefactual gene fusion resolution is indeed a point of strength of Mikado.

Let us know how it proceeds!

LehmannN commented 4 years ago

Thanks @lucventurini for your feedback !

Yes that's correct. The only thing is - I know it is an obvious remark but it is a common mistake" - please make sure that it is the mikado_prepared.fasta file that was used for TransDecoder.

I did that mistake of course... I thought TransDecoder needed the reference genome fasta, so thanks for specifying that !

I did not want to run Daijin assemble because I already have the novel genome annotations files and so I ran Portcullis and TransDecoder separately from the pipeline. Maybe not a good practice. Is it possible to specify in Daijin assemble to start with already built GTF files (without re-aligning + assembling the data) ?

lucventurini commented 4 years ago

Is it possible to specify in Daijin assemble to start with already built GTF files (without re-aligning + assembling the data) ?

Unfortunately not, those are intermediate steps for the pipeline.

I did not want to run Daijin assemble because I already have the novel genome annotations files and so I ran Portcullis and TransDecoder separately from the pipeline

Running portcullis outside should be fine. For the mikado related steps, I would recommend using daijin mikado instead - it should automate all the necessary steps for you.

LehmannN commented 4 years ago

For the mikado related steps, I would recommend using daijin mikado instead - it should automate all the necessary steps for you.

Ok, that's what I thought. Thanks a lot for your support ! Now I got a better idea of the whole pipeline I will use.

lucventurini commented 4 years ago

Closing for now.

EI-CoreBioinformatics / mikado

More general methodology question #311