GenomeRIK / tama

Transcriptome Annotation by Modular Algorithms (for long read RNA sequencing data)
GNU General Public License v3.0
125 stars 24 forks source link

tama_merge only some chromosomes output to bed #69

Closed rob123king closed 2 years ago

rob123king commented 2 years ago

running the below command and I only have some of the chromosomes in the output bed, others are skipped. I've been all the way to the end including adding CDS and just realised only have chr1, chr10-16. I wonder if some sorting issue that means only some of the entries are parsed?

tama_merge.py -f file.txt -p mergeset2

refset.bed  capped  1,1,1   ref
bothDNARNA.bed  no_cap  2,1,1   nanopore
GenomeRIK commented 2 years ago

Hello,

Are you sure the TAMA Merge completed? Did you get the success message at the end?

Cheers, Richard

rob123king commented 2 years ago

I seemed to have rerun again and got complete output but not much change and those different tend to look like incomplete or dodgy model additions. using nanopore data, maybe just one isoform is all I have but doesn't seem right. Same with PASA so might be something with the data or parameters. I'll try and pick this up again and work something out.

GenomeRIK commented 2 years ago

Oh sorry I just realized you are using Nanopore data. Did you use TAMA Collapse? There are special parameters I use with TAMA Collapse for Nanopore which may affect this.

rob123king commented 2 years ago

I used TAMA collapse. What's your special parameters for nanopore?

GenomeRIK commented 2 years ago

I would recommend using the following:

-a 100 -z 100 -i 85 -c 80 -x no_cap -sj sj_priority -sjt 10 -lde 5 -icm ident_map

rob123king commented 2 years ago

It's better with the setting changes, looks like some models in there that are different after tama merge. However there is a lot of junk in the bed file in terms of transcripts wrong orientation and fragments. I tried the -cds and -s option using the name of my reference bed file "refset.bed" and "refset" but didn't find the name in the file to carry over the reference gene names. I'm missing something there. But what would you suggest next steps. Is it as below? or is it change the settings to remove some of these artefact transcript annotations before the merge command:

To add CDS: TAMA-GO:-ORF-and-NMD-predictions convert to gff: TAMA GO: Formatting Then remove all transcript models without start and stop:?

GenomeRIK commented 2 years ago

Before you go to downstream analysis, it seems like you might have a major problem upstream. If you are getting a lot of transcripts oriented in the wrong direction and fragments then it sounds like maybe something happened during adapter cleanup. Did you use Pychopper? If so, did you use something like Seqkit to remove low quality reads before running Pychopper?

What you are seeing sounds like an issue with including low quality reads in a pychopper run. Basically it builds a model of adapters based on your input reads so if you have a high percentage of low quality reads it builds a bad model and identifies adapters where they aren't. It uses its adapter loci prediction to orient and trim the reads. So if not done properly it results in a lot of truncated reads and mis-oriented reads.

rob123king commented 2 years ago

I didn't do any cleanup. I've just used what the sequencing supplier had sent to me from using guppy. I have direct pooled RNA sample as well as individual tissue cDNA sets of data but using the cDNA set.

I'll look at seqkit then pychopper and try again.

GenomeRIK commented 2 years ago

Ok there is some info that will be useful for your data processing decisions (some of which you may want to get from the sequencing provider):

  1. What flowcell was used?
  2. What cDNA synthesis kit was used?
  3. What are the adapter sequences for the cDNA libraries?
  4. What Nanopore sequencing kit was used?
  5. What version of Guppy was used for basecalling?
  6. Which config file was used by guppy for basecalling?
  7. Was there anything done to the reads after basecalling?

And some questions for you:

  1. What mapper did you use to align the reads to your reference genome?
  2. What parameters did you use?

To get the best results you should have all that information. If you cannot get some of that information then you can either go for suboptimal processing or do some detective work on your data.

Cheers, Richard

rob123king commented 2 years ago

Thanks I'll chase that. I have the raw data before guppy so can start again but see what was done first. Applying pychopper to files I have and 75% is unclassfied which seems too high.

rob123king commented 2 years ago

Got the info. I'll start from raw and process again, then follow with pychopper and see if that gives better data. New guppy release today too.

  1. What flowcell was used? FLO-PRO002
  2. What cDNA synthesis kit was used? DCS109 (Direct-cDNA kit) plus EXP-NBD104 (Barcoding kit)
  3. What are the adapter sequences for the cDNA libraries? Barcodes NB07 – NB12 (barcode sequences are on page 20 of the attached document)
  4. What Nanopore sequencing kit was used? DCS109 (Direct-cDNA kit) plus EXP-NBD104 (Barcoding kit)
  5. What version of Guppy was used for basecalling? 5.0.17
  6. Which config file was used by guppy for basecalling? dna_r9.4.1_450bps_hac_prom.cfg
  7. Was there anything done to the reads after basecalling? trim_barcodes="on"
GenomeRIK commented 2 years ago

Great that you got all that info!

Just remember to use Seqkit to filter out reads with less than Q7 before running on Pychopper. Also make sure you are inputting the right primer set and in the right orientation and with the right naming for Pychopper.

You could also skip the pychopper step and map without strand knowledge but any reads with long poly A tails may not map well.

GenomeRIK commented 2 years ago

Oh yeah and I forgot to say that Pychopper does not clean up poly-A tails (at least from my last reading) so you will want to remove the poly-A tails after pychopper. You can use "tama_flnc_polya_cleanup.py" for this.

https://github.com/GenomeRIK/tama/wiki/TAMA-GO:-Sequence-Cleanup