GenomeRIK / tama

Transcriptome Annotation by Modular Algorithms (for long read RNA sequencing data)
GNU General Public License v3.0
125 stars 24 forks source link

Can I use FLAIR corrected alignment with tama collapse? #36

Closed HegedusB closed 3 years ago

HegedusB commented 3 years ago

Hi Rik, I am using the tama for a while and I can say it is a great software. This message is just a question and not an issue report. For curiosity, I tried to collapse alignment corrected by flair with tama. I wanted to try the flair out, because it is able to correct misaligned splice sites. The flair produce an alignment bed12 file what I converted into bam file. Than I used this bam file to collapse it with tama. The first few lines from the bam file looks like this:

67:553|1f202b78-ee34-44f4-ba8c-73d7edabbef5;16 16 scaffold_1 1424 255 388M55N68M55N34M 0 0 66:1618|eab3ca4a-40c9-45d4-bda4-019f0b265ff4;16 16 scaffold_1 1427 255 385M55N68M55N232M50N152M56N311M50N186M57N112M52N30M62N75M 0 0 62:1705|d776b5ad-11a1-4143-ba20-e4d0f549b7ca;16 16 scaffold_1 1441 255 371M55N68M55N232M50N152M56N311M50N186M57N112M52N30M62N224M 0 0 79:305|974bfdb6-1267-43d7-9ad2-06f156c16a10;16 16 scaffold_1 9731 255 195M 0 0 62:2408|75a2bdf7-ad07-4d10-8481-78df0ba3bca8;0 0 scaffold_1 11127 255 793M100N89M56N43M76N12M89N523M48N56M52N139M57N308M82N195M61N161M * 0 0 **

It is similar to the deSALT aligner output except it do not contain the read sequence. Do you thing can I use the tama for this type of data? When I tried I got this error message: Genome seq is not the same length as query seq. Thanks for your suggestions, Cheers, Botond

GenomeRIK commented 3 years ago

Hi Botond,

Thank you for your support!

I think for your application it would make more sense to run TAMA Merge using the BED12 output from FLAIR.

Could you tell me what you want to get out of running TAMA Collapse on the output? Is it just to get the non-redundant models? Or do you want other information as well?

If I can understand the purpose, then I could make suggestions or possibly modify TAMA Collapse to allow the kind of input you are using.

Cheers, Richard

HegedusB commented 3 years ago

Hi Richard,

So far I mapped the nanopore reads with deSALT, collapsed and merged the reads with tama, filtered out the best transcript which describes the genes, manually corrected the transcripts (corrected misalignments, the aligner made many mistakes with the short exons (<15 bp)). Now I have an acceptable quality annotation. Realigned the nanopore reads with the help of this annotation (helps a lot with the short exons) and collapsed and merged the reads with tama. My purpose is to identify all the relevant isoforms. However I have a problem. The deSALT sometimes not take into the account the annotation end miss align the reads splice site (with one or more nucleotides), which leads to the creation of menny artificial isoformes. I thought if I correct the errors within the alignment with the flair and use this corrected alignment with the tama than I can avoid the creation of the artificial isoforms. What do you think? Could this be a possible solution?

Cheers, Botond

GenomeRIK commented 3 years ago

Hi Botond,

Ok I think I understand now but I may be wrong so please correct me if I am wrong.

So you made a reference annotation using Nanopore reads and with some manual curation. You are now wanting to use that annotation to assess all the Nanopore reads from different samples (sort of bootstrapping)? And you are using the reference annotation and FLAIR to get better Nanopore read mapping information? And you want to feed this into the TAMA pipeline?

Ok so if the answer was yes to all of those questions then I can suggest a different way of doing this which does not involve FLAIR.

You should be able to run TAMA merge to merge all the TAMA Collapse runs you did and the reference annotation you made. Give priority to your reference annotation for TSS, TES, and Splice Junctions and allow for a wide wobble of something like 30bp (you can tune this as necessary).

So the filelist.txt file would look something like this: reference.bed capped 1,1,1 ref sample1.bed no_cap 2,2,2 sample1 sample2.bed no_cap 2,2,2 sample2 ...

The result will be a projection of all your runs onto your reference with adoption of the reference SJ calls for the original mappings that are similar enough (within Wobble threshold) to the reference models. So SJ calls from deSalt that are within the wobble threshold of the reference annotation will be converted to the reference annotation.

The gene and trans report files will give you info on all the mergings. This includes wobble for each exon start and end so you can see where these mis-mappings occurred.

Also you can run TAMA merge to pull the ID's from the reference using the "-s prefix" parameter where prefix should match the reference prefix as defined in the filelist.txt file.

And if you want I can alter my "tama_format_id_filter.py" tool to filter all models that do not match the reference.

Is this what you wanted to do? If not, could you explain how what you want to do is different?

Thank you, Richard

HegedusB commented 3 years ago

Hi Richard,

Basically yes. This is a really elegant way to do it. I will try it for sure! One question. My reads are capped and I want to catch the alternative TSS and TTS. If I give priority to the reference annotation for TSS, TES, and Splice Junctions is it possible to save the variability or the wobble size manage right that.

And a last one. My annotation is in a gtf format. If I convert it into a simple bed format is it enough or you using a special one?

Thanks for the really quick and useful answers! Cheers, Botond

GenomeRIK commented 3 years ago

Hi Botond,

My reads are capped and I want to catch the alternative TSS and TTS. If I give priority to the reference annotation for TSS, TES, and Splice Junctions is it possible to save the variability or the wobble size manage right that.

There are 2 ways you can do this. If you want the sample models to retain their TSS and TTS information you can give the TSS and TTS priorities at the same level as the reference: reference.bed capped 1,1,1 ref sample1.bed capped 1,2,1 sample1 sample2.bed capped 1,2,1 sample2

Note: be sure to use the capped mode for each.

However, if you only care about seeing how much TSS and TTS variability there is, then you will get that in the trans report file in the form of the first exon start wobble and the last exon end wobble.

I think for most applications it's best to do it the second way but I am not sure exactly how you will use the information so you will need to decide.

Also in the TAMA Merge parameters you can tune the TSS and TTS wobble using the "-a" and "-z" settings. So basically if you want to maintain different transcript models if the models have TSS differing by more than 100bp then you can use "-a 100". let me know if this is not clear and I can explain in more detail.

And a last one. My annotation is in a gtf format. If I convert it into a simple bed format is it enough or you using a special one?

You should convert it into bed12 using one of the TAMA conversion tools: https://github.com/GenomeRIK/tama/wiki/TAMA-GO:-Formatting

If you send me a few lines from your GTF file, I can tell you which tool to use for conversion.

Cheers, Richard

HegedusB commented 3 years ago

Hi Richard, Thank a lot! Here are a few lines from the annotation:

scaffold_1 prediction exon 1759 1811 . - . gene_id "mRNA_1033"; transcript_id "mRNA_1033"; annot_source "JGI_r2" scaffold_1 prediction exon 1867 1934 . - . gene_id "mRNA_1033"; transcript_id "mRNA_1033"; annot_source "JGI_r2" scaffold_1 prediction exon 1990 2221 . - . gene_id "mRNA_1033"; transcript_id "mRNA_1033"; annot_source "JGI_r2" scaffold_1 prediction exon 2272 2423 . - . gene_id "mRNA_1033"; transcript_id "mRNA_1033"; annot_source "JGI_r2"

Cheers, Botond

GenomeRIK commented 3 years ago

Hi Botond,

Ok assuming those are the first few lines of the GTF file you should probably use this tool: tama_format_gtf_to_bed12_stringtie.py

You will lose the annot_source info though as that is a bit of a custom sub-field. If you need to retain that let me know and I can build it in.

Also your gene and transcript ID's should be different (mRNA_1033 for both is not good) so hopefully there aren't further issues with the naming. You may run into errors when running the conversion if there are issues with the ID naming. Make sure you get the same number of transcripts after the conversion and let me know if you have issues with it.

What did you use to generate this GTF file? Did you do manual changes to it? This will give me an idea of what issues might pop up.

Cheers, Richard

HegedusB commented 3 years ago

Hi Richard,

I made it within R with rtracklayer package. The annot_source info is not so important, i will delet it anyway. I will rename the transcript and gene id as well. Thanks for the suggestions. I will inform you about the results. Thanks a lot, again.

Cheers, Botond

GenomeRIK commented 3 years ago

Hi Botond,

You can still try running the convertor without changing the gene and transcript ID's. Having the same ID for both should not cause issues with the conversion but it may lead to confusion later on. I pointed it out because it may indicate that there are other issues with the GTF formatting which may not work with the tool.

I think you can go ahead and just try running that GTF file through "tama_format_gtf_to_bed12_stringtie.py" and see if it works. If it doesn't work then we can see if there is any other formatting issue that needs to be changed.

Cheers, Richard

HegedusB commented 3 years ago

Hi Richard, I have tried to run the converter script but I got this error message: " Traceback (most recent call last): File "/SSD/software/tama/tama_go/format_converter/tama_format_gtf_to_bed12_stringtie.py", line 156, in gene_trans_dict[gene_id][trans_id].finish_bed() File "/SSD/software/tama/tama_go/format_converter/tama_format_gtf_to_bed12_stringtie.py", line 77, in finish_bed self.t_start = self.start_list[0] - 1 # adjust for bed indexing IndexError: list index out of range "

I used intsed the gtfToGenePred && genePredToBed steps and change the 4 column to gene_id;transcript_id. The tama_merge excepted as an input and finished the run without any error. The used command was this: " tama_merge.py -f annot_filelist_path -p ./tama_merge_out/Merge_result -a 50 -z 50 " The annot_filelist looked like this: copci_r3_annot_m.bed capped 1,1,1 ref ../sep_joint_mapping/tama_s_filtering_out//NB01.bed capped 1,2,1 NB01 ../sep_joint_mapping/tama_s_filtering_out//NB02.bed capped 1,2,1 NB02 ../sep_joint_mapping/tama_s_filtering_out//NB03.bed capped 1,2,1 NB03 ../sep_joint_mapping/tama_s_filtering_out//NB04.bed capped 1,2,1 NB04 ../sep_joint_mapping/tama_s_filtering_out//NB05.bed capped 1,2,1 NB05

However, when I compared the new result I can not find any changes. For example: wobbling_example_1

GenomeRIK commented 3 years ago

Hi Botond,

For the gtf to bed conversion:

Sorry I chose the wrong convertor for your gtf file. You can try tama_format_gtf_to_bed12_ncbi.py. I just added the directions in the wiki but the code was already in the repo so you should have it.

For the TAMA Merge run:

Could you share a screen shot of the full gene?

Thank you, Richard

HegedusB commented 3 years ago

Hi Richard, For the gtf to bed conversion: Yes the transcript numbers are the same.

For the TAMA Merge run:

full_gene

Cheers, Botond

GenomeRIK commented 3 years ago

Hi Botond,

Are you using the TAMA Merge BED12 file for the genome browser? Or did you convert back to GTF before uploading to the genome browser? The transcript models should be colour coded if you used the BED12 file.

Which transcript ID's belong to the reference annotation?

Cheers, Richard

HegedusB commented 3 years ago

Hi Richard,

Yes it was converted from bed to gtf. I used your "tama_convert_bed_gtf_ensembl_no_cds.py" script for that.

Cheers, Botond

HegedusB commented 3 years ago

An other thing. When I am using the -s parameter within the tama_merge.py I get this error message:

image

Can be a help if I send to you the annotation file for a check?

Cheers, Botond

GenomeRIK commented 3 years ago

Hi Botond,

Can you use the BED12 file for the genome browser and share the screen shot of the same gene? Also let me know which colour represents the reference.

As for the "-s" error, this is because you use "_" in your ID's which throws off the parsing since TAMA uses "_" to separate source name from transcript ID. I can fix this but I want to see what is going on with the merging behaviour first. In general though, for TAMA tools the underscore in ID is not good unless it came from TAMA. This just has to do with how things are parsed.

I'll see if I can fix the "-s" underscore issue and let you know when the update is on the repo. But in the meanwhile could you post the genome browser view with the BED12 file as input?

Cheers, Richard

GenomeRIK commented 3 years ago

Hi Botond,

I forgot to answer your other question/offer. If you can send me the annotation file that would be very helpful.

Cheers, Richard

HegedusB commented 3 years ago

Hi Richard, Can you send me an email address where i can send it. Cheers, Botond

GenomeRIK commented 3 years ago

Hi Botond,

Please see the email address info on the main page of this repo.

Thank you, Richard

HegedusB commented 3 years ago

Hi Richard, I am sorry but can not find. I apologize for that. Can you copy here. Cheers, Botond

GenomeRIK commented 3 years ago

GenomeRIK at Gmail dot com

GenomeRIK commented 3 years ago

Hi Botond,

Thank you for sending the annotation file.

So there are 2 things that are at play here.

  1. You are using an older version of TAMA. I think before I made changes to underscore parsing. I suggest you git clone the repo again.

  2. Your reference annotation has only 1 transcript model per gene. So my original suggestion will not work as intended. The original method only works if all the exon chains are represented in the reference annotation since it will only correct the splice junctions to the reference if the entire transcript model has a match. So if there is a transcript model with fewer or more exons as compared to the reference then it won't match and will not be converted to the reference splice junctions. This is to allow for analysis of more tricky situations where there might actually be an alternative splice site for a novel transcript model.

So actually your original intentions were probably the right way to go.

So here is my suggestion now (however this is me assuming some things about what you have and what you are doing):

Use the BED12 files you got from running FLAIR and merge them with TAMA merge using the same settings you were using for my last suggested pipeline.

So the same filelist.txt structure using the reference annotation but with each bed file being from the FLAIR BED12 output.

I think this will give you what you want but again this is based on my assumptions of your project which may not be accurate.

Also when you view the TAMA merge output, I suggest always viewing the BED12 file format as it will give you more information when viewing.

Let me know if you still have issues with TAMA Merge "-s" after you try the latest version of TAMA.

Cheers, Richard

HegedusB commented 3 years ago

Hi Richard, I see. Thanks for the explanation. The FLAIR output basically contains all the mapped reads, almost 5M pers sample. Do you think TAMA Merge can handle that without a previous collapse step? Can I somehow extract the read count as it was possible with the collapse -> filter -> merge workflow? Cheers, Botond

GenomeRIK commented 3 years ago

Hi Botond,

TAMA Merge is limited by the compute resources available so I cannot answer if it will be able to handle all the FLAIR files. However, you can actually run it individually on the FLAIR files first and then merge them all together as you did above. So this would produce the same result but allow you to run everything with less memory.

To run individually you just use the single file in the filelist.txt file and run TAMA Merge as you would usually do. This will act like TAMA Collapse but without doing variant calling and all the other sequence based analysis.

You can still get the read count by using the "tama_read_support_levels.py" tool. So this allows you to carry through read count information across the different TAMA modules. Since you have a BED line for each read, TAMA will treat each BED12 line as a read if that makes sense.

So let's say you run TAMA merge on the BED file for a single sample. You would then want to run "tama_read_support_levels.py" with the following filelist_readsupport.txt file (don't mix this up with the TAMA Merge filelist.txt file):

sample1 merge.txt trans_read

And then run the tool like: python tama_read_support_levels.py -f filelist_readsupport.txt -o sample1_merge -m no_merge

That will generate a read support file and you will then use read support files for all downstream runs of "tama_read_support_levels.py" to keep read count information as you tack on other modules.

Note: You need to change the filelist_readsupport.txt file and parameters for "tama_read_support_levels.py" depending on the level of run. https://github.com/GenomeRIK/tama/wiki/TAMA-GO:-Read-Support

Cheers, Richard

HegedusB commented 3 years ago

Hi Richard,

Ok! Now it is more clear. The FLAIR bed output differ a little from what the TAMA collapse produced. FLAIR output: scaffold_1 1423 2023 67:553|1f202b78-ee34-44f4-ba8c-73d7edabbef5;16 60 - 1423 2023 99,99,99 3 388,68,34, 0,443,566, scaffold_1 1426 3414 66:1618|eab3ca4a-40c9-45d4-bda4-019f0b265ff4;16 60 - 1426 3414 99,99,99 9 385,68,232,152,311,186,112,30,75, 0,440,563,845,1053,1414,1657,1821,1913,

Afther I modified the column 4 to this: scaffold_1 1423 2023 read_1;1f202b78-ee34-44f4-ba8c-73d7edabbef5_1 60 - 1423 2023 99,99,99 3 388,68,34, 0,443,566, scaffold_1 1426 3414 read_2;eab3ca4a-40c9-45d4-bda4-019f0b265ff4_2 60 - 1426 3414 99,99,99 9 385,68,232,152,311,186,112,30,75, 0,440,563,845,1053,1414,1657,1821,1913,

the TAME merged start to run. At first sign the produced bed file represent the flair corrected alignment. The program is still running so I am not sure what the result will look like. I will inform you when it is finished.

Cheers, Botond

GenomeRIK commented 3 years ago

Hi Botond,

I was just about to say that you would need to change the ID's but looks like you already knew that.

This is an interesting use of TAMA so I am excited to see if this works out for you.

Cheers, Richard

GenomeRIK commented 3 years ago

Hi Botond,

It has been a while so I am closing this issue. Please re-open if the issue has not been resolved.

Cheers, Richard

HegedusB commented 3 years ago

Hi Richard,

I am sorry because I was silent for a long time. The merge took more than a week to finish. What I did. I merged the flair output separately with tama merge. Than I counted read_support. Filtered the transcripts with low read support value (read support ==1). Merged the filtered bed files with tama. Converted to gtf and analysed with sqanti3. With this workflow I could bind 10 percent more transcript to my reference annotation. But I have got some problems. The tama merge when merges the flair output produce transcripts where some exons end position has a lower value than the start position. I do not know why this happens. Furthermore, I can not figure out how could I count read support for the filtered merged flair output and for the last merge output (when I merge the five filtered bed files). Additionally the second merge step only worked when I used the "-d merge_dup" option. Do you have any suggestions how can I do this workflow correctly?

Thanks for your help! Cheers, Botond

GenomeRIK commented 3 years ago

Hi Botond,

The merge took more than a week to finish.

Did the TAMA Merge take 1 week to run or this whole pipeline? Just curious about issues with run time for TAMA Merge.

But I have got some problems. The tama merge when merges the flair output produce transcripts where some exons end position has a lower value than the start position. I do not know why this happens.

Can you send examples of the situations where TAMA merge shows exon end positions with lower value than starting position? I would like to see the bed lines from TAMA Merge and the corresponding bed lines from the FLAIR bed files.

Furthermore, I can not figure out how could I count read support for the filtered merged flair output and for the last merge output (when I merge the five filtered bed files).

Are you referring to how to keep all the read support information from the very beginning through multiple TAMA Merges?

Additionally the second merge step only worked when I used the "-d merge_dup" option. Do you have any suggestions how can I do this workflow correctly?

Ok when you need to run the "-d merge_dup" option it means there is a tricky merge situation going on where there are multiple paths to the same model. This is a bit complicated to explain but it can indicate a problematic region from your input models. I suspect this may be due to a behavior of FLAIR but I think it will be more clear when I get the examples of the exon start and end issues from you.

Cheers, Richard

HegedusB commented 3 years ago

Hi Richard,

I am sorry for the long silence. I try to collect the asked data. The TAMA Merge with the flair corrected alignment took more than the (5 samples runs after each other - more than a week) TAMA Collapse on the deSALT alignment but used less memory. The main reason behind this is the uneven distribution of the aligned reads. Most of the reads aligned to a small number of genes. These genes can halt the TAMA Merge and TAMA Collapse as well but the TAMA Merge more.

Cheers, Botond

HegedusB commented 3 years ago

Hi Richard,

You were right. The exon start-end position problem was related to the flair. After resolving the problem the tama worked properly. My only problem is to get the readsupport for the merged samples. My workflow: 5 sample -> correct with flair (separately) -> merge with tama (separately) -> count read support (separately) -> filter out singletons (separately) -> merge the 5 filtered sample together -> ? read support -> sqanti3 What file should I use after the second merge step for read support calculation. I am a little confused here. When I am doing the proper collapse -> merge workflow it is clear but now it is hard to figure it out. Thanks for your help.

Cheers, Botond

GenomeRIK commented 3 years ago

Hi Botond,

Sorry for not responding to this. Things got a bit busy for me. I am going to close this now but if you still need help with this just start a new thread.

Thank you, Richard