WGLab / lncScore

A python package for the identification of lncRNA from the assembled novel transcripts
17 stars 11 forks source link

Documentation #9

Open ofonov opened 7 years ago

ofonov commented 7 years ago

Hi,

Could you please clarify documentation of the lncScore. I am new in the lncRNA prediction field and some things were not clear for me. In particular:

-f input files, --file=input files, enter transcripts in .bed or .fasta Does that mean that one have to use as an input here a bed or a fasta of assembled transcriptome? Or is it something else? -g gtf file name, --gtf=gtf file name please enter your gtf files Do you mean that hear should be annotation files, e.g Homo_sapiens.GRCh38.84.gtf? -x hexamer matrix, --hex=hexamer matrix Does it have to be build for different genome versions e.g GRCh38 or GRCh37, or is it universal? -t training dataset, --train=training dataset The same question as above, does it have to be build for different genome versions e.g GRCh38 or GRCh37, or is it universal?

Thank you in advance for clarification. It would be really useful for me, and perhaps for other users of your software. It In advance, I have to state that I did read the paper, however I did not find answers to these questions. Perhaps I missed something out.

frankMusacchia commented 7 years ago

Hi there,

I have the same question. I could not understand if the tool can be used just giving it in input the fasta transcriptome and the ORFs or it needs something other. Can you please clarify?

Thanks!

zhaodoctor commented 7 years ago

In additon to the .fasta file, you should also provide the .gtf file. The .gtf file would be used to extract the exon informations for each transcripts in the .fasta file. So, the .gtf file should contain all the exon informations of transcripts in your .fasta file. And you can reference the format of .gtf file as it was used in the GENCODE dataset. The ORFs would be calcualted in the lncScore program, so the ORFs is not necessary. And the hexamer matrix and training dataset is built based on the sequences, so they have nothing to do with the the genome versions.

frankMusacchia commented 7 years ago

Ok. Thanks. Could you please suggest a program to convert a fasta into GTF format?

About the training dataset, the "dat" folder contains only the training set and hexamer files for human and mouse but for other species only the XX_hexamer.tsv is contained. How can I produce the training set?

zhaodoctor commented 7 years ago

I thought that only a fasta could not be converted into GTF format, as the fasta file does not contain any exon informations. And I don't know any program that can convert a fasta into GTF format. If you want to proudce your own training model, you can use 'make_TrainingDat.py' in the 'tools' folder. And you can also use 'make_hexamer_tab.py' in the 'tools' folder to produce your own xx_hexamer.tsv.

frankMusacchia commented 7 years ago

How will I produce the GTF file if I only have a newly assembled denovo transcriptome? Do I need that a reference genome is present to localize exactly the exons for lncScore?

zhaodoctor commented 7 years ago

Sorry, I have no idea about producing the GTF file with a newly de denovo assembled transcritome, as I did not do any de novo assembly for a non-model organism. When you used lncScore, you should provide the exon information for it. And if you have the exact exon information, you can produce a gtf file using the format in the gtf file of the GENCODE dataset by yourself.

frankMusacchia commented 7 years ago

Ok thanks a lot for your quick responses