BrooksLabUCSC / flair

Full-Length Alternative Isoform analysis of RNA
Other
203 stars 69 forks source link

Issues regarding flair #1

Closed ashokpatowary closed 5 years ago

ashokpatowary commented 5 years ago

Hi @anbrooks @belgravia

I tried to run the flair pipeline on my MinION RNAseq dataset. I have faces some issues.

  1. while trying align I have came across the following error

[ERROR] unknown preset ' splice '

So I mapped the reads independently with minimap2. Thereafter while using the collapse options python /u/home/Resource/flair/flair.py collapse -r test.fastq -g GRCh37.primary_assembly.genome.fa -q test_strand_corrected.psl -m minimap2 -f GenCode_v28.gtf, it gives me a blank output file. Also I tried using psl_to_sequence.py tools separately, it gives a blank output file.

Am I doing anything wrong?

belgravia commented 5 years ago

I am taking a guess that what's causing psl_to_sequence.py to fail is an inability to match the chromosome names in the reference genome fasta with the names in the psl. I have pushed a new version of bin/psl_to_sequence.py that may help. Otherwise, I'd need to take a quick look at what your files look like.

In regards to the minimap2 usage error, that's been fixed with the newest flair.py wrapper script I just pushed, thank you for notifying me of that. That step is definitely fine to do outside of the wrapper script, as long as you convert to psl moving forward.

ashokpatowary commented 5 years ago

Hi @belgravia

Thanks for your quick response. I will try the updated one now. Just one more query. While running the last version, I was having 723040 entries in my input .psl, 722688 entries in _strand.psl; 129339 entries in corrected.gp; and only 10941 entries in *_strand_corrected.psl. Does it looks ok?

Thanks

belgravia commented 5 years ago

There is an expected drop in entries after correction since any reads that contain a splice junction not conforming to GT-AG splice motif and isn't within window of an annotated GT-AG is removed. A >40% drop in entries is larger than what I've normally seen.

If *_strand_corrected.psl is your output file name from the flair correct step, then what may be happening is: 1) Many entries are being filtered out for uncorrectable splice sites (--> corrected.gp) 2) More entries are being filtered out for containing novel, but valid sites (these filtering steps produce strand_corrected.psl)

Potential fixes are: 1a) You could increase the window size (-w) or use a more comprehensive annotation 1b) Your genepred genome annotation may be misformatted for how FLAIR expects it. Have you compared with the genepred in the example files yet? 2a) May be fixed with a better annotation file (1b) 2b) If you choose to move forward with novel, valid splice sites you can now specify just -n in the correct command and those will be retained (please update your flair.py script in that case)

And there should be little to no drop for the first strand/gene inference step. To address that, I'd have to look at a the missing entries since it's unclear to me what could be causing that.

ashokpatowary commented 5 years ago

Hi @belgravia

I am still having issues with running it. Also I tried to check the intermidiate files. Run4_strand_corrected_novels.collapse1.psl is a blank file

Inferring strand for reads in PSL Correcting reads in Run4_strand.psl chrY chrX chr13 chr12 chr11 chr10 chr17 chr16 chr15 chr14 chr19 chr18 chrM chr22 chr20 chr21 chr7 chr6 chr5 chr4 chr3 chr2 chr1 chr9 chr8 Converting output gp to PSL Collapsing isoforms Read data extracted, collapsing Traceback (most recent call last): File "/u/home/flair/bin/collapse_isoforms_precise.py", line 303, in ends = find_best_sites(isoforms[chrom][jset]['tss'], isoforms[chrom][jset]['tss_tes'], chrom) File "/u/home/Resource/flair/bin/collapse_isoforms_precise.py", line 124, in find_best_sites found_tss = find_tsss(sites_tss_all, finding_tss=True, max_results=max_results, chrom=chrom) File "/u/home/Resource/flair/bin/collapse_isoforms_precise.py", line 105, in find_tsss if t in annotends[chrom] and t - bestsite[0] < closest_annotated - bestsite[0]: KeyError: 'chrY' Filtering isoforms Renaming isoforms Aligning reads to first-pass isoform reference [M::mm_idx_gen::0.0017.01] collected minimizers [M::mm_idx_gen::0.0035.19] sorted minimizers [M::main::0.0035.06] loaded/built the index for 0 target sequence(s) [M::mm_mapopt_update::0.0034.92] mid_occ = 0 [M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 0 [M::mm_idx_stat::0.0034.82] distinct minimizers: 0 (-nan% are singletons); average occurrences: -nan; average spacing: -nan [M::worker_pipeline::23.1350.76] mapped 453269 sequences [M::main] Version: 2.12-r847-dirty [M::main] CMD: minimap2 -a -t 4 --secondary=no /u/home/MinION/Run_04_Part/correction_output/Run4_strand_corrected_novels.collapse1.fa /u/project/ /nanopore/test.fastq [M::main] Real time: 23.195 sec; CPU: 17.540 sec; Peak RSS: 0.845 GB Counting isoform expression Filtering isoforms by read coverage Removing intermediate files/done!

belgravia commented 5 years ago

After the FLAIR v1..0 release, I had pushed an updated version of collapse_isoforms_precise.py that takes an optional gtf that also contained the bug that you found. I have fixed it now, so you can either use the most recent version (commit aa25f2c) or just use the version in the v1.0 release (no gtf support).

On another note, I hope you have fixed your issue with the loss of reads after novel junction filtering!

ashokpatowary commented 5 years ago

@belgravia

It working perfect with updates. -novel option solve the issues related to loss of reads. Thank you for your help. I am having one last query. During collapse the minimap2 command use was CMD: minimap2 -a -t 4 --secondary=no /u/home/MinION/Run_04_Part/correction_output/Run4_strand_corrected_novels.collapse1.fa /u/project/nanopore/big.fastq

Are you considering the collapse.fa as the ref genome?

Regards

belgravia commented 5 years ago

I'm glad it's working for you now. Yes, that step is aligning the reads directly to the first-pass isoforms to assign an isoform to each read. Isoforms that surpass some number threshold of reads supporting them are retained in the final isoform output.

ashokpatowary commented 5 years ago

@belgravia

Thanks for the explanation and also for this nice tools. It makes life easy.

ashokpatowary commented 5 years ago

Hi Alison,

I have a query indirectly related to flair script. In the flair output, we get the fasta files on the NOVEL isoforms. Is there any way we can have the fasta files of the KNOWN isoforms of the flair-identified NOVEL isoforms from the input file?

Thanks and regards

On Thu, Oct 18, 2018 at 12:36 PM Alison notifications@github.com wrote:

I'm glad it's working for you now. Yes, that step is aligning the reads directly to the first-pass isoforms to assign an isoform to each read. Isoforms that surpass some number threshold of reads supporting them are retained in the final isoform output.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/BrooksLabUCSC/flair/issues/1#issuecomment-431133540, or mute the thread https://github.com/notifications/unsubscribe-auth/AopMlU0aleGY9QBh482riCA5rwUs6SMBks5umNgggaJpZM4Xc6oZ .

belgravia commented 5 years ago

The fasta files should include all isoforms, including the annotated ones. If you run bin/identify_similar_annotated_isoforms.py on your final isoform psl, it should rename the isoforms that are annotated to their names in the annotation. This script is already run in the flair.py wrapper.

So for human data, the annotated isoforms should be renamed to 'ENST....' and the novel ones should have the nanopore read name as their name.

Also I'm going to be pushing a faster version of this script and others quite soon, if you can wait a little bit.

ashokpatowary commented 5 years ago

Hi Alison,

Thanks for your response. I can see and updated version in flair. Is this the version you were talking about?

Regards Ashok Patowary, PhD University of California Los Angeles 635 Charles E Young Dr S Los Angeles CA 90095 http://www.ashokpatowary.com/

On Thu, Nov 29, 2018 at 4:24 PM Alison notifications@github.com wrote:

The fasta files should include all isoforms, including the annotated ones. If you run bin/identify_similar_annotated_isoforms.py on your final isoform psl, it should rename the isoforms that are annotated to their names in the annotation. This script is already run in the flair.py wrapper.

So for human data, the annotated isoforms should be renamed to 'ENST....' and the novel ones should have the nanopore read name as their name.

Also I'm going to be pushing a faster version of this script and others quite soon, if you can wait a little bit.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/BrooksLabUCSC/flair/issues/1#issuecomment-443044459, or mute the thread https://github.com/notifications/unsubscribe-auth/AopMld6dAYxN0QZUtc7LNAvKCd4resrEks5u0HqggaJpZM4Xc6oZ .

belgravia commented 5 years ago

Sorry for delay in response! Yes, that script is correct. I combined scripts and there's now a script called bin/identify_gene_isoform.py. You can run it with a command like python identify_gene_isoform.py yourisoforms.psl annotation.gtf yourisoforms_named.psl. The script will output a file using the name of the last argument with the name field rewritten with ensemble gene IDs and transcript IDs. Isoforms that do not have a transcript ID would be novel isoforms.