SQANTI/Pigeon classification on merged files

sum732 commented 1 year ago

Hello,

Thanks for creating TAMA and also creating a great wiki to understand the I/O of TAMA. It helps so much.

I have 4 long read samples, that I processed via Pigeon and got classifications (FSM etc). Next, I used TAMA to create a non-redundant set and its all worked. However, I am not sure to take this NR isoform set and classify again with Pigeon. The requirements are to have gff format. I can convert a bed->GFF that information is not conserved.

How can take this out from TAMA

chr1    14401   29350   G1.1;pigeon_1451_PB.1.1 40      -       14401   29350   255,0,0 9       428,69,152,159,198,510,147,6624,30      0,568,1394,2205,2456,2831,3513,3866,14919
chr1    14401   29359   G1.2;pigeon_1451_PB.1.2 40      -       14401   29359   255,0,0 10      428,152,159,198,136,137,147,102,5979,39 0,1394,2205,2456,2831,3204,3513,3866,4511,14919
chr1    14406   20544   G1.3;pigeon_1451_PB.1.3 40      -       14406   20544   255,0,0 8       423,69,152,159,202,510,147,2277 0,563,1389,2200,2447,2826,3508,3861
chr1    14406   29346   G1.4;pigeon_0456_PB.1.2 40      -       14406   29346   255,0,0 11      423,69,152,159,198,136,137,147,99,5979,26       0,563,1389,2200,2451,2826,3199,3508,3861,4506,14914
chr1    14406   29346   G1.5;pigeon_0456_PB.1.1 40      -       14406   29346   255,0,0 8       1904,159,180,510,147,99,5979,26 0,2200,2469,2826,3508,3861,4506,14914
chr1    21981   29345   G1.6;pigeon_0456_PB.1.3 40      -       21981   29345   255,0,0 2       2910,25 0,7339

to something like

chr1    PacBio  transcript      14407   29345   .       -       .       gene_id "PB.1"; transcript_id "PB.1.1";
chr1    PacBio  exon    14407   14829   .       -       .       gene_id "PB.1"; transcript_id "PB.1.1";
chr1    PacBio  exon    14970   15038   .       -       .       gene_id "PB.1"; transcript_id "PB.1.1";
chr1    PacBio  exon    15796   15947   .       -       .       gene_id "PB.1"; transcript_id "PB.1.1";
chr1    PacBio  exon    16607   16765   .       -       .       gene_id "PB.1"; transcript_id "PB.1.1";
chr1    PacBio  exon    16854   17055   .       -       .       gene_id "PB.1"; transcript_id "PB.1.1";
chr1    PacBio  exon    17233   17751   .       -       .       gene_id "PB.1"; transcript_id "PB.1.1";
chr1    PacBio  exon    17915   18061   .       -       .       gene_id "PB.1"; transcript_id "PB.1.1";
chr1    PacBio  exon    18268   24891   .       -       .       gene_id "PB.1"; transcript_id "PB.1.1";
chr1    PacBio  exon    29321   29345   .       -       .       gene_id "PB.1"; transcript_id "PB.1.1";

Any help will be greatly appreciated.

Many Thanks SM

sum732 commented 1 year ago

Ha, I think answer is here somewhere https://github.com/GenomeRIK/tama/wiki/TAMA-GO:-Formatting

sum732 commented 1 year ago

Ahh, it did not worked. I used tama_convert_bed_gtf_ensembl_no_cds.py. Not sure if its GFF format by Pigeon or something else. I am getting terminate called after throwing an instance of 'std::out_of_range' what(): basic_string::substr: __pos from Pigeon. Not sure if you have ideas...

GenomeRIK commented 1 year ago

Hello SM,

Thank you for using TAMA!

Could you link me to the Pigeon site? I suspect it is built to expect a very specific formatting for the transcript and gene ID's. If this is the case you could either hack the ID info from TAMA or you could ask the Pigeon developers to widen the formatting allowances.

If I can have a look at their code, I can figure out what is going on.

Thank you, Richard

sum732 commented 1 year ago

Hi Richard,

Thanks for replying, much appreciated. The official website is here: https://isoseq.how/classification/pigeon.html

As a side note, I am wondering if I can pick your brain on collapse of TAMA vs gffcompare merge. Its seems I am getting different results. Now I do expect some difference, but the numbers are in thousands and that is bit puzzling.

My Best, SM

GenomeRIK commented 1 year ago

Hi SM,

Ok yeah sorry I didn't realized Pigeon was PacBio's code. So it's basically their own flavour of SQANTI. I can't seem to find the code so if you have a link to the code that would be helpful. However, I suspect they built it specifically for their ID naming conventions. I also suspect they did this on purpose. So you could ask them to loosen their ID acceptance but I am not sure they will.

I think it is worth a shot to ask them.

However, if that does not work you could use SQANTI instead which should work with TAMA outputs. Or you could hack the TAMA ID's to look like PacBio's ID's. So this would involve making a basic script to parse the PacBio ID from the TAMA output and reform the original PacBio ID format. For example turning this "G1.1;pigeon_1451_PB.1.1" into this "PB.1;PB.1.1". If you then ran this modified BED file through the TAMA bed to gtf convertor it should work.

Is there a specific reason you are wanting to use Pigeon instead of SQANTI?

Regarding the comparison between TAMA and gffcompare merge, do you mean TAMA Merge? Or do you mean TAMA Collapse?

TAMA Merge is the comparative tool to gffcompare merge. So if you are referring to that, then it comes down to how each tool chooses to accept that two different models should be merged. TAMA Merge is designed to be tuned using the parameter settings so you can make it act like gffcompare merge if you wanted to but you can also make it behave differently.

Also I am not sure if you are running TAMA Collapse and Merge with default parameters but if you are then it is not a good idea to use the default parameters. TAMA was designed to be tuned to your specific type of data with respect to species and sequencing characteristics. The default parameters are only there as a starting point for playing around with the tool but they are not designed for any specific species or sequencing characteristics.

The default parameters are also relatively very stringent on merging models which means you tend to get a lot more models in the output.

If you would like to understand the parameters more, I suggest reading the TAMA paper: https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-020-07123-7

Thank you, Richard

sum732 commented 1 year ago

Hi Richard,

Thanks for promptly responding.

I will dig more into this. Indeed, will try to see if accepted ID definition can be loosened.

Using Pigeon as now it is official Iso-Seq software suite. Some of cupcake scripts are now depreciated. So just keeping up the changes and best possible supports.

Thanks for the pointers on parameters. I am using TAMA Merge. I has used Pigeon to collapse. Used TAMA merge to make a union. Took the same collapsed GTFs from Pigeon and now used gffcompare to merge. There is 'capping' parameter in TAMA and I said trust all transcripts from all the PacBio samples. Do not see any specific parameters in gffCompare. I will need read more on the parameters to figure things out.

My Best, SM

GenomeRIK commented 1 year ago

Hi SM,

The official Iso-Seq software is somewhat limited. I think you should read the TAMA paper to understand some of the issues.

I suggest using TAMA Collapse and TAMA Merge and SQANTI. But it's up to you.

There are also other good academia developed software for collapsing:

You can find some of them here: https://long-read-tools.org

Also I am not sure why you are using gffcompare?

Thank you, Richard

sum732 commented 1 year ago

Thanks Richard, I will for sure read the TAMA paper and thanks for links. I have short read data as well, and want(try) to take away any bias due to the tools. I can compare transcripts and its classification for the identical sample from long and short read. For short read, its StringTie to assemble and it produces GTF, that I can classify with for example Pigeon. Then I have gffcomapre to make a NR set for union that I can further compare with the long read data. BW, SM

GenomeRIK commented 1 year ago

Hi SM,

I am going to close this thread now but feel free to reopen it if you are still having issues with this. Or you can open a new one if it's something different.

Thank you, Richard

sum732 commented 1 year ago

Thanks a lot Richard. Will do as suggested

GenomeRIK / tama

SQANTI/Pigeon classification on merged files #98