Closed kmnip closed 5 years ago
Hi Ka Ming,
That would be nice. I'll work on this.
Hamza
Hi Hamza, I will be also very interesting to use this option! Let me know if you need any help
Hi Kristina and Ka Ming,
I have pushed a new script (FindSubcomponents.py) that will take the output of MakeSplicegraph.py (ie, the DOT file) and generate a csv file by default. This csv file will have two columns - 1) GeneID - Numbered splice graphs 2) ExonID - Exons that belong to a specific geneID
There is another option --writesplicesubgraphs which will write a DOT file of splicegraph's subcomponents. You don't need to use ccomps now as this new script uses the NetworkX library to find connected components. I hope this works. Let me know if you have any other questions or suggestions.
Cheers, Hamza
Thank you so much Hamza! We will test it out this Friday! :)
Hi Hamza,
Your Python script output exon IDs. Can you output transcript ID (instead of exon ID), which was our original request?
Ka Ming
A bit of background for you... The reason for this request was meant to be used for RNA-seq quantification tools like Salmon, ie.
-g [ --geneMap ] arg File containing a mapping of
transcripts to genes. If this file is
provided Salmon will output both
quant.sf and quant.genes.sf files,
where the latter contains aggregated
gene-level abundance estimates. The
transcript to gene mapping should be
provided as either a GTF file, or a in
a simple tab-delimited format where
each line contains the name of a
transcript and the gene to which it
belongs separated by a tab. The
extension of the file is used to
determine how the file should be
parsed. Files ending in '.gtf', '.gff'
or '.gff3' are assumed to be in GTF
format; files with any other extension
are assumed to be in the simple format.
In GTF / GFF format, the
"transcript_id" is assumed to contain
the transcript identifier and the
"gene_id" is assumed to contain the
corresponding gene identifier.
Hi Ka Ming,
Thanks for the clarification. ChopStitch just uses the prefix of the transcript name, ie, if the transcript ID is k25S217353 883 5678 59107+,...,165824- in TransABySS output, Chopstitch will use the word just before the first space, which in this case will be k25S217353. Will it work if I output something as under or would you like the full transcript name in the gene map TSV file?
TranscriptID GeneID k25S217353 1 k25S222919 2 k35S134377 2 k25S222982 2 k35S131172 3 k25J217281 3 k25J218389 4 k35S130112 4 k25S217752 5
Hi Hamza,
Yes, using just the word before the first space would be perfect!
Ka Ming
Just added a --geneMap option. Let me know if that works. :)
Thanks Hamza! :)
I am using the Ensembl reference transcripts as a test now. The networkx readDot module is giving me warnings about the node names, eg.
Warning: syntax ambiguity - badly delimited number '.1_' in line 779798 of input splits into two tokens
It doesn't like the .
in the node names, eg.
ENST00000644329.1_543_661->ENST00000644329.1_662_708;
If you wrap the node names in your output dot files from MakeSplicegraph.py
with double-quotes, then there are no more warnings, ie.
"ENST00000644329.1_543_661"->"ENST00000644329.1_662_708";
Thanks. I'll update it.
Would it be possible to have a new option in ChopStitch to print transcript to gene mapping?
That is, each splice graph would be assigned a gene id and the tab-delimited file would list the mapping like so: