ablab / IsoQuant

Transcript discovery and quantification with long RNA reads (Nanopores and PacBio)
https://ablab.github.io/IsoQuant/
Other
150 stars 13 forks source link

How do I specify FLNC reads in IsoQuant #226

Open sanyalab opened 2 months ago

sanyalab commented 2 months ago

Hi,

I have Pacbio FLNC reads in fastq format. What options should be specified while running the tool. I was thinking --data_type pacbio --fl_data. Is this correct?

Thanks Abhijit

andrewprzh commented 2 months ago

Dear @sanyalab

Yes, this set of options is correct.

Best Andrey

sanyalab commented 2 months ago

Hi Andrey,

A few other questions?

  1. Are you guys sure there is no difference between pacbio and pacbio_ccs? I used 1.2 million IsoSeq FLNC reads and got 7780 transcripts for pacbio_ccs and 5438 for pacbio. I am using the 3.5 version.
  2. What is the difference among default_pacbio, sensitive_pacbio, and fl_pacbio other than the transcript number.
  3. I am working with a fungal genome (<100MB) in a contig state, that has 2 haplotypes. 2a. Do I concatenate the haplotype genomes and use them together for IsoQuant or use these separately as I have done above. 2b. Does this decrease (1.2 mil to ~8000) seem reasonable for a fungal genome? Any suggestions on the optimal number of reads (genome agnostic) for IsoQuant?

Thanks Abhijit

andrewprzh commented 2 months ago

@sanyalab

Are you guys sure there is no difference between pacbio and pacbio_ccs? I used 1.2 million IsoSeq FLNC reads and got 7780 transcripts for pacbio_ccs and 5438 for pacbio. I am using the 3.5 version.

Yes, they are just aliases. Could you send me the logs for these runs?

What is the difference among default_pacbio, sensitive_pacbio, and fl_pacbio other than the transcript number.

These are just different option presets. sensitive_pacbio applies slightly lighter filters compared to default_pacbio. fl_pacbio requires known transcripts to be covered by FSM reads to be reported. From the user perspective the only difference is the number of reported transcripts.

I am working with a fungal genome (<100MB) in a contig state, that has 2 haplotypes. 2a. Do I concatenate the haplotype genomes and use them together for IsoQuant or use these separately as I have done above.

I have very little experience with diploid genomes, especially highly diploid. I would first try to create a consensus genome, if even possible. If not, using them separately could be better, since there can be way too much multimappers when using concatenated genome.

2b. Does this decrease (1.2 mil to ~8000) seem reasonable for a fungal genome? Any suggestions on the optimal number of reads (genome agnostic) for IsoQuant?

It's very hard to predict now many novel transcripts should be detected and what is a reasonable number. It depends on how well the genome itself, how well it is sequenced, how deep is your sequencing etc. So, the only suggestion I can give is to check relative genomes or try different settings / tools and compare the output.

Best Andrey

sanyalab commented 2 months ago

Hi Andrey,

I'll generate the files again. since it was a test and I was playing with the hyperparameters, I did'nt know what to retain. No worries, I'll generate the files and send you the logs. Thank you for the insightful comments.

-Abhijit

sanyalab commented 4 days ago

Hi Andrey,

When I use FLNC reads, the run_log has the following statements

2024-10-28 10:17:13,058 - INFO - Total assignments used for analysis: 38576426, polyA tail detected in 40436 (0.1%)
2024-10-28 10:17:13,058 - WARNING - PolyA percentage is suspiciously low. IsoQuant expects non-polya-trimmed reads. If you aim to construct transcript models, consider using --polya_requirement option.
2024-10-28 10:17:13,058 - INFO - Processing assigned reads OUT
2024-10-28 10:17:13,058 - INFO - Transcript models construction is turned on
2024-10-28 10:17:13,078 - INFO - Transcript construction options:
2024-10-28 10:17:13,078 - INFO -   Novel monoexonic transcripts will be reported: yes
2024-10-28 10:17:13,078 - INFO -   PolyA tails are required for multi-exon transcripts to be reported: no
2024-10-28 10:17:13,078 - INFO -   PolyA tails are required for 2-exon transcripts to be reported: no
2024-10-28 10:17:13,078 - INFO -   PolyA tails are required for known monoexon transcripts to be reported: no
2024-10-28 10:17:13,078 - INFO -   PolyA tails are required for novel monoexon transcripts to be reported: yes
2024-10-28 10:17:13,078 - INFO -   Splice site reporting level: only_stranded

I am using FLNC reads without the polyA. In fact the IsoSeq3 pipeline creates FLNC's and polyA trimming in the same step.

  1. Do you suggest I use the reads from one step before FLNC? At that stage they are NOT non-chimeric. Only Full_Length. OR
  2. Do you suggest I artificially introduce polyA at the ends of FLNC?
  3. My aim is to create transcript models. I do not always have a genome annotation to guide the process. The above run_log is from a process where the genome annotation was available. Therefore, probably the FLNC reads were ok. Without the genome annotation how do I proceed? FLNC required or FL with polyA required.

Thanks Abhijit

andrewprzh commented 4 days ago

Dear @sanyalab

In general IsoQuant does benefit from presence of polyA tails, but it's not mandatory. Usually polyAs help to detect 3' ends more precisely.

I'd suggest to try approach 1 first and see how IsoQuant deals with this kinds of data. I presume it should be quite safe sing IsoQuant is designed to work with raw ONT and PacBio CSS reads without any processing. With FLNC you can also use --fl_data flag without adding polyA tails, but second strategy can be an option too.

In the future we also plan to support other IsoSeq files that have information about polyA tails.

Best Andrey