Open wososa opened 4 years ago
Hi Woody, Yes you are right - please use the Kallisto index built by using protein coding transcripts in Gencode v19 index. More Kallisto support would be added in the future so that the RBP-tpm generation is more standardized. Sorry for the inconvenience, as this is currently hard-coded conversions; but for now please build Kallisto index using the Gencode FASTA sequences here: ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.pc_transcripts.fa.gz
On Nov 16, 2019, at 10:22 AM, Woody Lin notifications@github.com wrote:
Hi Dr. Zhang,
I am trying the following commend to run DARTS:
Darts_DNN build_feature -i bayes_infer/A5SS.darts_bht.flat.txt -c ~/.darts/DNN/v0.1.0/trainedParam/A5SS-trainedParam-EncodeRoadmap.h5 -e Sample_WT_kallisto Sample_KD_kallisto -o A5SS_data.h5 --t A5SS
I got the following error message: 2019-11-16 10:14:12,982 - Darts_DNN.build_feature - INFO - convert tx to gene TPM Traceback (most recent call last): ...skip... KeyError: 'ENST00000631435'
Does this mean that I am using the wrong files (or wrong version of gene annotation) from kallisto?
Files in the kallisto folder (based on Ensemble v96): abundance.h5 abundance.tsv run_info.json
Thanks, Woody
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Xinglab/DARTS/issues/14?email_source=notifications&email_token=ADHQFZ6VDTKN7FUECSICARTQUAF3PA5CNFSM4JOFS4OKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HZZN43A, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADHQFZZZNRCC7JVMOL5CFF3QUAF3PANCNFSM4JOFS4OA.
Hi Dr. Zhang,
Thanks for your quick reply. I proceeded with gencode v19, installed the python module "tables", and found the follow error:
`
/Darts/RBP_tpm.txt
.. read sequence feature
Traceback (most recent call last):
File "/anaconda3/envs/darts/bin/Darts_DNN", line 4, in
File "/anaconda3/envs/darts/lib/python2.7/site-packages/Darts_DNN-0.1.0-py2.7.egg/EGG-INFO/scripts/Darts_DNN", line 49, in main
File "/anaconda3/envs/darts/lib/python2.7/site-packages/Darts_DNN-0.1.0-py2.7.egg/Darts_DNN/Darts_build_feature.py", line 157, in parser File "/anaconda3/envs/darts/lib/python2.7/site-packages/Darts_DNN-0.1.0-py2.7.egg/Darts_DNN/Darts_build_feature.py", line 98, in make_single_table File "/anaconda3/envs/darts/lib/python2.7/site-packages/Darts_DNN-0.1.0-py2.7.egg/Darts_DNN/utils.py", line 326, in read_sequence_feature File "/anaconda3/envs/darts/lib/python2.7/site-packages/pandas/io/pytables.py", line 377, in read_hdf raise ValueError('No dataset in HDF5 file.') ValueError: No dataset in HDF5 file. `
RBP_tpm.txt
has been generated sucessfully. hd5 file wasn't generated. Could you elaborate more on this error?
Thanks, Woody
@wososa Please use predict
directly without build_features
. build_features
is a legacy sub-command that took more disk usage and would be discarded in the future. Please follow an usage example here, in case it's helpful to pinpoint further issues:
https://darts-dnn.readthedocs.io/en/latest/#using-predict
I have updated the README.md to avoid future confusions.
@zj-zhang Thanks for your reply. build_features
is needed to produce RBP_tmp.txt
, right? It seems that I need RBP_tmp.txt
file to run predict
function.
@wososa Not necessarily, actually. For example, you can run predict
directly like so:
Darts_DNN predict -i darts_flat/Sp_out.txt \
-o darts_pred.txt \
-e kallisto/Day5_rep1/,kallisto/Day5_rep2/,kallisto/Day5_rep3/ kallisto/No_Dox_rep1/,kallisto/No_Dox_rep2/,kallisto/No_Dox_rep3/
It was illustrated in the help message by running Darts_DNN predict with -h
option:
$ Darts_DNN predict -h
usage: Darts_DNN predict [-h] -i INPUT -o OUTPUT [-t {SE,A5SS,A3SS,RI}]
[-e EXPR [EXPR ...]] [-m MODEL]
optional arguments:
-h, --help show this help message and exit
-i INPUT Input feature file (*.h5) or Darts_BHT output (*.txt)
-o OUTPUT Output filename
-t {SE,A5SS,A3SS,RI} Optional, default SE: specify the alternative splicing
event type. SE: skipped exons, A3SS: alternative 3
splice sites, A5SS: alternative 5 splice sites, RI:
retained introns
-e EXPR [EXPR ...] Optional, required if input is Darts_BHT output;
Folder path for Kallisto expression files; e.g '-e
Ctrl_rep1,Ctrl_rep2 KD_rep1,KD_rep2'
-m MODEL Optional, default using current version model in user
home directory: Filepath for a specific model
parameter file
Hope this helps.
In fact, in case it might be potentially useful for others, let me add that using predict
directly is currently the encouraged way to using Darts_DNN :) Thanks again @wososa
I can understand now. Thanks!
@zj-zhang My A5SS.darts_bht.flat.txt
has 5,084 records, but the A5SS_pred.txt
file only has 36 records. Any idea why many of the records are lost during the Darts_DNN predict
step?
@wososa Most likely it's because the majority of the A5SS in your file does not have pre-compiled cis-sequence features. Could you check the ID overlapping between A5SS.darts_bht.flat.txt
and $HOME/.darts/DNN/v0.1.0/cisFeature/A5SS.norm.txt.gz
?
@zj-zhang Thanks for your quick reply. If the number of overlapping events is small, does it mean that my A5SS events are new to the gencode annotation? I probably can't process the big amount of RNA-seq datasets in DARTS-DNN to re-generate the features.
Yes if number of overlapping events is small, that means the A5SS events are likely novel events specific in your RNA-seq data. The sequence features were compiled by @zcpan ; If that's indeed the case, I will open a new issue for that so we could better keep track.
I am not too sure what went wrong but appears that Darts_DNN is not recognizing input directory supplied with -e parameter I ran darts_DNN the way was suggested: Darts_DNN predict -i A5SS.darts_bht.flat.converted_hg19.txt -e /Genotypes/tmp/DARTS_RNA/RealRun/CHR17Run/kallisto/output_KU/ -o predA5SS.txt -t A5SS
constructing in-memory feature matrix
Traceback (most recent call last):
File "/anaconda3/envs/darts/bin/Darts_DNN", line 4, in
File "/anaconda3/envs/darts/lib/python2.7/site-packages/Darts_DNN-0.1.0-py2.7.egg/EGG-INFO/scripts/Darts_DNN", line 44, in main
File "/envs/darts/lib/python2.7/site-packages/Darts_DNN-0.1.0-py2.7.egg/Darts_DNN/Darts_pred.py", line 103, in parser File "/envs/darts/lib/python2.7/site-packages/Darts_DNN-0.1.0-py2.7.egg/Darts_DNN/utils.py", line 285, in construct_training_data_from_label Exception: this file is not found: /Genotypes/tmp/DARTS_RNA/RealRun/CHR17Run/kallisto_Fasta/output_KU
Any suggestions how to sort this out? It would be really helpful if standarized liftover (hg38->hg19) and standardized pred file generation could be added to the manual. Right now seems that kallisto ran well without any errors, all 3 output files were produced (abundance.h5, abundance.tsv, run_info.json), why this input was not suitable? Could this error be originating from A5SS.darts_bht.flat.converted_hg19.txt file by any chance?
Hi Dr. Zhang,
I am trying the following commend to run DARTS:
Darts_DNN build_feature -i bayes_infer/A5SS.darts_bht.flat.txt -c ~/.darts/DNN/v0.1.0/trainedParam/A5SS-trainedParam-EncodeRoadmap.h5 -e Sample_WT_kallisto Sample_KD_kallisto -o A5SS_data.h5 --t A5SS
I got the following error message:
2019-11-16 10:14:12,982 - Darts_DNN.build_feature - INFO - convert tx to gene TPM Traceback (most recent call last): ...skip... KeyError: 'ENST00000631435'
Does this mean that I am using the wrong files (or wrong version of gene annotation) from kallisto?
Files in the kallisto folder (based on Ensemble v96):
abundance.h5 abundance.tsv run_info.json
Thanks, Woody