BrooksLabUCSC / flair

Full-Length Alternative Isoform analysis of RNA
Other
205 stars 71 forks source link

Error running flair collapse with both --annotation_reliant and --check_splice #221

Closed cafelton closed 1 year ago

cafelton commented 1 year ago

Installed flair via pip3, running it from the pip installed directory

python3 /private/home/cafelton/.local/lib/python3.6/site-packages/flair/flair.py collapse -g /private/groups/brookslab/reference_sequence/GRCh38.primary_assembly.genome.fa -f /private/groups/brookslab/reference_annotations/gencode.v38.annotation.gtf --annotation_reliant /private/groups/brookslab/reference_sequence/gencode.v38.transcripts.fa --stringent --check_splice -r DRR059313.fastq -q test-new-flair-correct_all_corrected.bed -o test-new-flair-collapse

Writing temporary files to /scratch/tmp/tmp3692oi5i/
Making transcript fasta using annotated gtf and genome sequence Aligning reads to reference transcripts Counting supporting reads for annotated transcripts multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "/usr/lib64/python3.6/multiprocessing/pool.py", line 119, in worker result = (True, func(*args, *kwds)) File "/usr/lib64/python3.6/multiprocessing/pool.py", line 44, in mapstar return list(map(args)) File "/private/home/cafelton/.local/lib/python3.6/site-packages/flair/count_sam_transcripts.py", line 203, in count_transcripts_for_reads if relstart+pos in splice_sites[t] or relstart+pos+1 in splice_sites[t] \ KeyError: 'ENST00000275493.7' """

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/private/home/cafelton/.local/lib/python3.6/site-packages/flair/count_sam_transcripts.py", line 363, in counts = p.map(count_transcripts_for_reads, grouped_reads) File "/usr/lib64/python3.6/multiprocessing/pool.py", line 266, in map return self._map_async(func, iterable, mapstar, chunksize).get() File "/usr/lib64/python3.6/multiprocessing/pool.py", line 644, in get raise self._value KeyError: 'ENST00000275493.7' Failed at counting step for annotated isoform read support

It seems like on line 615, flair.py calls gtf_to_psl.py with the --include_gene option, which messes up the transcript names in the annotation. The problem is that the names in the annotated_transcripts.bed file (ENST00000640748.1_ENSG00000125991.19) are being used as keys in a dictionary that is then referenced using just transcript names (ENST00000640748.1), so they don’t match up. Removing that option in the flair.py file seems to fix the problem and not introduce any new ones.

Jeltje commented 1 year ago

Thanks for this detailed description of the problem.

We cannot simply remove the --include_gene option in the code because that breaks a lot of other stuff.

The problem appears to be that the gencode.v38.transcripts.fa file does not contain transcript IDs in the format flair expects. They look like this:

>ENST00000456328.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000362751.1|DDX11L1-202|DDX11L1|1657|processed_transcript|

If we run flair's gtf_to_psl and then psl_to_sequence the resulting file only contains the transcript IDs. I emulated that by running sed 's/|.*//' on the file and using that as input instead. I used the command you listed with this new file, and using a version of flair in which issue #222 is fixed.

Can you have a look at the /private/groups/brookslab/jeltje/test-n files and tell me if that looks like the output you expect? If so I will create better error messaging for this particular issue and commit the code.

cafelton commented 1 year ago

Yes, this output looks right. Better error messaging would be good, but so would adding into the documentation that the annotation_reliant generate option is highly recommended. The transcripts.fa we were using was the official gencode one, so if flair can't work with that, it probably needs to work exclusively with transcripts.fa files that it generates.

Jeltje commented 1 year ago

That's a really good point, I will do that.