ScienceParkStudyGroup / studyGroup

Gather together a group to skill-share, co-work, and create community
https://www.scienceparkstudygroup.info
Other
6 stars 12 forks source link

Custom GTF file indexing #30

Closed Fred-White94 closed 6 years ago

Fred-White94 commented 6 years ago

Hello World,

I have been attempting to use Kallisto to analyse genes and transposable elements in tandem. This requires creating a custom gtf file to be used for kallisto index. It seems as though you can merge two gtf files as long as you are aware of the order and the transcript identifiers. Following this you can use the tophat command gtf_to_fasta to build the fasta file for the indexing. At this stage however it reorders many of the transcripts and sometimes makes the final output quite complicated to decipher in terms of target_id identification.

If anyone is aware of a better way of the custom index creation instead of just using single transcriptome files which are available online then please post it here.

Cheers

mgalland commented 6 years ago

Hello Fred. What I think you should do is:

  1. Get the fasta file for genes (mRNAs) using the 'getfasta' command from bedtools for instance. Find the command link here. You now have a "gene" fasta file.
  2. Do the same for repeats. You get a "repeat" fasta file.
  3. Combine the two fasta files with something like 'cat genes.fasta repeats.fasta >> reference.fasta'
  4. Create the kallisto index with 'kallisto index --index=reference.index reference.fasta'

Let me know how it goes!

Fred-White94 commented 6 years ago

Hi Marc,

This definitely works on a sampled dataset - I haven't had time to scale up yet..

I would definitely recommend combining transcripts together at the fasta file stage rather than a as a gtf as this seems to cause problems when converting to fasta if there are transcripts that share a similar genomic location.

Thanks again