Xinglab / DARTS

Deep-learning Augmented RNA-seq analysis of Transcript Splicing
Other
110 stars 32 forks source link

About kallisto files #14

Open wososa opened 4 years ago

wososa commented 4 years ago

Hi Dr. Zhang,

I am trying the following commend to run DARTS:

Darts_DNN build_feature -i bayes_infer/A5SS.darts_bht.flat.txt -c ~/.darts/DNN/v0.1.0/trainedParam/A5SS-trainedParam-EncodeRoadmap.h5 -e Sample_WT_kallisto Sample_KD_kallisto -o A5SS_data.h5 --t A5SS

I got the following error message: 2019-11-16 10:14:12,982 - Darts_DNN.build_feature - INFO - convert tx to gene TPM Traceback (most recent call last): ...skip... KeyError: 'ENST00000631435'

Does this mean that I am using the wrong files (or wrong version of gene annotation) from kallisto?

Files in the kallisto folder (based on Ensemble v96): abundance.h5 abundance.tsv run_info.json

Thanks, Woody

zj-zhang commented 4 years ago

Hi Woody, Yes you are right - please use the Kallisto index built by using protein coding transcripts in Gencode v19 index. More Kallisto support would be added in the future so that the RBP-tpm generation is more standardized. Sorry for the inconvenience, as this is currently hard-coded conversions; but for now please build Kallisto index using the Gencode FASTA sequences here: ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.pc_transcripts.fa.gz

On Nov 16, 2019, at 10:22 AM, Woody Lin notifications@github.com wrote:

Hi Dr. Zhang,

I am trying the following commend to run DARTS:

Darts_DNN build_feature -i bayes_infer/A5SS.darts_bht.flat.txt -c ~/.darts/DNN/v0.1.0/trainedParam/A5SS-trainedParam-EncodeRoadmap.h5 -e Sample_WT_kallisto Sample_KD_kallisto -o A5SS_data.h5 --t A5SS

I got the following error message: 2019-11-16 10:14:12,982 - Darts_DNN.build_feature - INFO - convert tx to gene TPM Traceback (most recent call last): ...skip... KeyError: 'ENST00000631435'

Does this mean that I am using the wrong files (or wrong version of gene annotation) from kallisto?

Files in the kallisto folder (based on Ensemble v96): abundance.h5 abundance.tsv run_info.json

Thanks, Woody

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Xinglab/DARTS/issues/14?email_source=notifications&email_token=ADHQFZ6VDTKN7FUECSICARTQUAF3PA5CNFSM4JOFS4OKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HZZN43A, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADHQFZZZNRCC7JVMOL5CFF3QUAF3PANCNFSM4JOFS4OA.

wososa commented 4 years ago

Hi Dr. Zhang,

Thanks for your quick reply. I proceeded with gencode v19, installed the python module "tables", and found the follow error:

` /Darts/RBP_tpm.txt .. read sequence feature Traceback (most recent call last): File "/anaconda3/envs/darts/bin/Darts_DNN", line 4, in import('pkg_resources').run_script('Darts-DNN==0.1.0', 'Darts_DNN') File "/anaconda3/envs/darts/lib/python2.7/site-packages/pkg_resources/init.py", line 666, in run_script self.require(requires)[0].run_script(script_name, ns) File "/anaconda3/envs/darts/lib/python2.7/site-packages/pkg_resources/init.py", line 1460, in run_script exec(script_code, namespace, namespace) File "/anaconda3/envs/darts/lib/python2.7/site-packages/Darts_DNN-0.1.0-py2.7.egg/EGG-INFO/scripts/Darts_DNN", line 192, in

File "/anaconda3/envs/darts/lib/python2.7/site-packages/Darts_DNN-0.1.0-py2.7.egg/EGG-INFO/scripts/Darts_DNN", line 49, in main

File "/anaconda3/envs/darts/lib/python2.7/site-packages/Darts_DNN-0.1.0-py2.7.egg/Darts_DNN/Darts_build_feature.py", line 157, in parser File "/anaconda3/envs/darts/lib/python2.7/site-packages/Darts_DNN-0.1.0-py2.7.egg/Darts_DNN/Darts_build_feature.py", line 98, in make_single_table File "/anaconda3/envs/darts/lib/python2.7/site-packages/Darts_DNN-0.1.0-py2.7.egg/Darts_DNN/utils.py", line 326, in read_sequence_feature File "/anaconda3/envs/darts/lib/python2.7/site-packages/pandas/io/pytables.py", line 377, in read_hdf raise ValueError('No dataset in HDF5 file.') ValueError: No dataset in HDF5 file. `

RBP_tpm.txt has been generated sucessfully. hd5 file wasn't generated. Could you elaborate more on this error?

Thanks, Woody

zj-zhang commented 4 years ago

@wososa Please use predict directly without build_features. build_features is a legacy sub-command that took more disk usage and would be discarded in the future. Please follow an usage example here, in case it's helpful to pinpoint further issues: https://darts-dnn.readthedocs.io/en/latest/#using-predict I have updated the README.md to avoid future confusions.

wososa commented 4 years ago

@zj-zhang Thanks for your reply. build_features is needed to produce RBP_tmp.txt, right? It seems that I need RBP_tmp.txt file to run predict function.

zj-zhang commented 4 years ago

@wososa Not necessarily, actually. For example, you can run predict directly like so:

Darts_DNN predict -i darts_flat/Sp_out.txt \
-o darts_pred.txt \
-e kallisto/Day5_rep1/,kallisto/Day5_rep2/,kallisto/Day5_rep3/ kallisto/No_Dox_rep1/,kallisto/No_Dox_rep2/,kallisto/No_Dox_rep3/

It was illustrated in the help message by running Darts_DNN predict with -h option:

$ Darts_DNN predict -h
usage: Darts_DNN predict [-h] -i INPUT -o OUTPUT [-t {SE,A5SS,A3SS,RI}]
                         [-e EXPR [EXPR ...]] [-m MODEL]

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT              Input feature file (*.h5) or Darts_BHT output (*.txt)
  -o OUTPUT             Output filename
  -t {SE,A5SS,A3SS,RI}  Optional, default SE: specify the alternative splicing
                        event type. SE: skipped exons, A3SS: alternative 3
                        splice sites, A5SS: alternative 5 splice sites, RI:
                        retained introns
  -e EXPR [EXPR ...]    Optional, required if input is Darts_BHT output;
                        Folder path for Kallisto expression files; e.g '-e
                        Ctrl_rep1,Ctrl_rep2 KD_rep1,KD_rep2'
  -m MODEL              Optional, default using current version model in user
                        home directory: Filepath for a specific model
                        parameter file

Hope this helps.

zj-zhang commented 4 years ago

In fact, in case it might be potentially useful for others, let me add that using predict directly is currently the encouraged way to using Darts_DNN :) Thanks again @wososa

wososa commented 4 years ago

I can understand now. Thanks!

wososa commented 4 years ago

@zj-zhang My A5SS.darts_bht.flat.txt has 5,084 records, but the A5SS_pred.txt file only has 36 records. Any idea why many of the records are lost during the Darts_DNN predict step?

zj-zhang commented 4 years ago

@wososa Most likely it's because the majority of the A5SS in your file does not have pre-compiled cis-sequence features. Could you check the ID overlapping between A5SS.darts_bht.flat.txt and $HOME/.darts/DNN/v0.1.0/cisFeature/A5SS.norm.txt.gz?

wososa commented 4 years ago

@zj-zhang Thanks for your quick reply. If the number of overlapping events is small, does it mean that my A5SS events are new to the gencode annotation? I probably can't process the big amount of RNA-seq datasets in DARTS-DNN to re-generate the features.

zj-zhang commented 4 years ago

Yes if number of overlapping events is small, that means the A5SS events are likely novel events specific in your RNA-seq data. The sequence features were compiled by @zcpan ; If that's indeed the case, I will open a new issue for that so we could better keep track.

astulaaa commented 2 years ago

I am not too sure what went wrong but appears that Darts_DNN is not recognizing input directory supplied with -e parameter I ran darts_DNN the way was suggested: Darts_DNN predict -i A5SS.darts_bht.flat.converted_hg19.txt -e /Genotypes/tmp/DARTS_RNA/RealRun/CHR17Run/kallisto/output_KU/ -o predA5SS.txt -t A5SS

constructing in-memory feature matrix Traceback (most recent call last): File "/anaconda3/envs/darts/bin/Darts_DNN", line 4, in import('pkg_resources').run_script('Darts-DNN==0.1.0', 'Darts_DNN') File "/anaconda3/envs/darts/lib/python2.7/site-packages/pkg_resources/init.py", line 666, in run_script self.require(requires)[0].run_script(script_name, ns) File "/anaconda3/envs/darts/lib/python2.7/site-packages/pkg_resources/init.py", line 1469, in run_script exec(script_code, namespace, namespace) File "/anaconda3/envs/darts/lib/python2.7/site-packages/Darts_DNN-0.1.0-py2.7.egg/EGG-INFO/scripts/Darts_DNN", line 192, in

File "/anaconda3/envs/darts/lib/python2.7/site-packages/Darts_DNN-0.1.0-py2.7.egg/EGG-INFO/scripts/Darts_DNN", line 44, in main

File "/envs/darts/lib/python2.7/site-packages/Darts_DNN-0.1.0-py2.7.egg/Darts_DNN/Darts_pred.py", line 103, in parser File "/envs/darts/lib/python2.7/site-packages/Darts_DNN-0.1.0-py2.7.egg/Darts_DNN/utils.py", line 285, in construct_training_data_from_label Exception: this file is not found: /Genotypes/tmp/DARTS_RNA/RealRun/CHR17Run/kallisto_Fasta/output_KU

Any suggestions how to sort this out? It would be really helpful if standarized liftover (hg38->hg19) and standardized pred file generation could be added to the manual. Right now seems that kallisto ran well without any errors, all 3 output files were produced (abundance.h5, abundance.tsv, run_info.json), why this input was not suitable? Could this error be originating from A5SS.darts_bht.flat.converted_hg19.txt file by any chance?