Open friederhadlich opened 7 years ago
The problem is that the path of the 'Matrix' is wrong, and you should copy the folder 'dat' in the path '/media/disk3/lncScore/' to the path '/media/disk3/lncScore/tools/', then the file 'Matrix' would be found and the probelm would be solved.
I copied the 'dat' folder like explained but get this error message (like already mentioned on top):
python lncScore/tools/make_TrainingDat.py -m mrnas.fa -g merged-unknowns.gtf -l ../lincrnas.fa --hex cow-hexamer.tsv -p 1 -o training-out Traceback (most recent call last): File "lncScore/tools/make_TrainingDat.py", line 524, in
mainProcess(ARRAY,outPutFileName,1,coding,noncoding,Alphabet,Matrix_hash,mRNA_num) File "lncScore/tools/make_TrainingDat.py", line 352, in mainProcess Max_Mscore_exon = max(Exons_mscore) ValueError: max() arg is an empty sequence
Please let me know what to do ...
The error occured on the caculation of exon features, so I wonder if that some transcripts in the mrans.fa or lincrnas.fa have not correponding exon information in the merged-unknowns.gtf? You can check the 'inputfile.fasta' file, in each label line (begining with '>') there should be some numbers (exon length) behind the transcript id.
Please redownload the exon_extraction.pl in the 'cpmodule' folder, I have reedited this script and it would delete those transcripts whose exon information can not be found in the gtf file.
Hi Zhao,
maybe I really use incorrect input data. MRNA-FILE: To generate mrna fasta file I filtered the gtf file from ncbi cow rna for protein_codings. Its output is converted to fasta format using cattle reference genome and gffread. In the resulting fasta file, exon length information is missing in the header line. LNCRNA-FILE: Exon information is also missing in lincrnas.fa because this file is directly downloaded from noncode database and contains no genomic information. GTF-FILE: File with information about unknown rna sequences. These have to be classified into lncRNAs and mRNAs.
Please let me know how to proceed. Frieder
The mRNA and gtf can be downloaded from the Ensembl database (http://asia.ensembl.org/info/data/ftp/index.html), however there is none lincRNA information in it. Sorry, I have no idea for finding the exon information of lincrnas from noncode database.
I am having a similar problem while building training.dat. When i build the training.dat file it returns me a result with only +1s in the first column. I checked the other pre-processed files and they must have +1 and -1 categories. I also checked my input files many times, seems perfect.
Hi,
I tried to prepare the training dataset for lncSCORE.py using this command:
Rerunning the python-script from folder above generates this output:
Do you have any ideas how to go on???
Thanks in advance, Frieder