Problem generating training dataset

WGLab / lncScore

A python package for the identification of lncRNA from the assembled novel transcripts

17 stars 11 forks source link

Problem generating training dataset #6

Open friederhadlich opened 7 years ago

friederhadlich commented 7 years ago

Hi,

I tried to prepare the training dataset for lncSCORE.py using this command:

python tools/make_TrainingDat.py -m mrnas.fa -g merged-unknowns.gtf -l lincrnas.fa --hex cow-hexamer.tsv -p 10 -o training-out Traceback (most recent call last): File "tools/make_TrainingDat.py", line 481, in inMatrix = open(MatrixPath) IOError: [Errno 2] No such file or directory: '/media/disk3/lncScore/tools/dat/Matrix'

Rerunning the python-script from folder above generates this output:

Traceback (most recent call last): File "/PATH/lncScore/make_TrainingDat.py", line 524, in mainProcess(ARRAY,outPutFileName,1,coding,noncoding,Alphabet,Matrix_hash,mRNA_num) File "/PATH/lncScore/make_TrainingDat.py", line 352, in mainProcess Max_Mscore_exon = max(Exons_mscore) ValueError: max() arg is an empty sequence

Do you have any ideas how to go on???

Thanks in advance, Frieder

zhaodoctor commented 7 years ago

The problem is that the path of the 'Matrix' is wrong, and you should copy the folder 'dat' in the path '/media/disk3/lncScore/' to the path '/media/disk3/lncScore/tools/', then the file 'Matrix' would be found and the probelm would be solved.

friederhadlich commented 7 years ago

I copied the 'dat' folder like explained but get this error message (like already mentioned on top):

python lncScore/tools/make_TrainingDat.py -m mrnas.fa -g merged-unknowns.gtf -l ../lincrnas.fa --hex cow-hexamer.tsv -p 1 -o training-out Traceback (most recent call last): File "lncScore/tools/make_TrainingDat.py", line 524, in mainProcess(ARRAY,outPutFileName,1,coding,noncoding,Alphabet,Matrix_hash,mRNA_num) File "lncScore/tools/make_TrainingDat.py", line 352, in mainProcess Max_Mscore_exon = max(Exons_mscore) ValueError: max() arg is an empty sequence

Please let me know what to do ...

zhaodoctor commented 7 years ago

The error occured on the caculation of exon features, so I wonder if that some transcripts in the mrans.fa or lincrnas.fa have not correponding exon information in the merged-unknowns.gtf? You can check the 'inputfile.fasta' file, in each label line (begining with '>') there should be some numbers (exon length) behind the transcript id.

zhaodoctor commented 7 years ago

Please redownload the exon_extraction.pl in the 'cpmodule' folder, I have reedited this script and it would delete those transcripts whose exon information can not be found in the gtf file.

friederhadlich commented 7 years ago

Hi Zhao,

maybe I really use incorrect input data. MRNA-FILE: To generate mrna fasta file I filtered the gtf file from ncbi cow rna for protein_codings. Its output is converted to fasta format using cattle reference genome and gffread. In the resulting fasta file, exon length information is missing in the header line. LNCRNA-FILE: Exon information is also missing in lincrnas.fa because this file is directly downloaded from noncode database and contains no genomic information. GTF-FILE: File with information about unknown rna sequences. These have to be classified into lncRNAs and mRNAs.

Please let me know how to proceed. Frieder

zhaodoctor commented 7 years ago

The mRNA and gtf can be downloaded from the Ensembl database (http://asia.ensembl.org/info/data/ftp/index.html), however there is none lincRNA information in it. Sorry, I have no idea for finding the exon information of lincrnas from noncode database.

frcsantos commented 6 years ago

I am having a similar problem while building training.dat. When i build the training.dat file it returns me a result with only +1s in the first column. I checked the other pre-processed files and they must have +1 and -1 categories. I also checked my input files many times, seems perfect.