bcgsc / ChopStitch

Finding putative exons and constructing splicegraphs using Trans-ABySS contigs
http://www.bcgsc.ca/platform/bioinfo/software/chopstitch
GNU General Public License v3.0
12 stars 1 forks source link

Option to output transcript to gene mapping? #4

Closed kmnip closed 5 years ago

kmnip commented 5 years ago

Would it be possible to have a new option in ChopStitch to print transcript to gene mapping?

That is, each splice graph would be assigned a gene id and the tab-delimited file would list the mapping like so:

T1<tab>G1
T2<tab>G1
T3<tab>G1
T4<tab>G2
T5<tab>G2
...
hamzakhanvit commented 5 years ago

Hi Ka Ming,

That would be nice. I'll work on this.

Hamza

KristinaGagalova commented 5 years ago

Hi Hamza, I will be also very interesting to use this option! Let me know if you need any help

hamzakhanvit commented 5 years ago

Hi Kristina and Ka Ming,

I have pushed a new script (FindSubcomponents.py) that will take the output of MakeSplicegraph.py (ie, the DOT file) and generate a csv file by default. This csv file will have two columns - 1) GeneID - Numbered splice graphs 2) ExonID - Exons that belong to a specific geneID

There is another option --writesplicesubgraphs which will write a DOT file of splicegraph's subcomponents. You don't need to use ccomps now as this new script uses the NetworkX library to find connected components. I hope this works. Let me know if you have any other questions or suggestions.

Cheers, Hamza

kmnip commented 5 years ago

Thank you so much Hamza! We will test it out this Friday! :)

kmnip commented 5 years ago

Hi Hamza,

Your Python script output exon IDs. Can you output transcript ID (instead of exon ID), which was our original request?

Ka Ming

kmnip commented 5 years ago

A bit of background for you... The reason for this request was meant to be used for RNA-seq quantification tools like Salmon, ie.

  -g [ --geneMap ] arg                  File containing a mapping of 
                                        transcripts to genes.  If this file is 
                                        provided Salmon will output both 
                                        quant.sf and quant.genes.sf files, 
                                        where the latter contains aggregated 
                                        gene-level abundance estimates.  The 
                                        transcript to gene mapping should be 
                                        provided as either a GTF file, or a in 
                                        a simple tab-delimited format where 
                                        each line contains the name of a 
                                        transcript and the gene to which it 
                                        belongs separated by a tab.  The 
                                        extension of the file is used to 
                                        determine how the file should be 
                                        parsed.  Files ending in '.gtf', '.gff'
                                        or '.gff3' are assumed to be in GTF 
                                        format; files with any other extension 
                                        are assumed to be in the simple format.
                                        In GTF / GFF format, the 
                                        "transcript_id" is assumed to contain 
                                        the transcript identifier and the 
                                        "gene_id" is assumed to contain the 
                                        corresponding gene identifier.
hamzakhanvit commented 5 years ago

Hi Ka Ming,

Thanks for the clarification. ChopStitch just uses the prefix of the transcript name, ie, if the transcript ID is k25S217353 883 5678 59107+,...,165824- in TransABySS output, Chopstitch will use the word just before the first space, which in this case will be k25S217353. Will it work if I output something as under or would you like the full transcript name in the gene map TSV file?

TranscriptID GeneID k25S217353 1 k25S222919 2 k35S134377 2 k25S222982 2 k35S131172 3 k25J217281 3 k25J218389 4 k35S130112 4 k25S217752 5

kmnip commented 5 years ago

Hi Hamza,

Yes, using just the word before the first space would be perfect!

Ka Ming

hamzakhanvit commented 5 years ago

Just added a --geneMap option. Let me know if that works. :)

kmnip commented 5 years ago

Thanks Hamza! :)

I am using the Ensembl reference transcripts as a test now. The networkx readDot module is giving me warnings about the node names, eg.

Warning: syntax ambiguity - badly delimited number '.1_' in line 779798 of input splits into two tokens

It doesn't like the . in the node names, eg.

ENST00000644329.1_543_661->ENST00000644329.1_662_708;

If you wrap the node names in your output dot files from MakeSplicegraph.py with double-quotes, then there are no more warnings, ie.

"ENST00000644329.1_543_661"->"ENST00000644329.1_662_708";
hamzakhanvit commented 5 years ago

Thanks. I'll update it.