GoekeLab / xpore

Identification of differential RNA modifications from nanopore direct RNA sequencing
https://xpore.readthedocs.io/
MIT License
132 stars 22 forks source link

remove pyensembl requirement and solve Arabidopsis bug #57

Closed yuukiiwa closed 3 years ago

yuukiiwa commented 3 years ago

Here are the xpore-dataprep runs with Human and Arabidopsis references and annotations:

(base) yukkei@yukkeis-Mac-mini solve_bug_arabidopsis % ls -lh
total 3557856
-rw-------@  1 yukkei  staff   215M May 20 16:31 Arabidopsis_thaliana.TAIR10.50.gtf
-rw-r--r--@  1 yukkei  staff    94M May 20 16:37 Arabidopsis_thaliana.TAIR10.cdna.all.fa
-rw-r--r--@  1 yukkei  staff   1.0G May 25 15:23 Homo_sapiens.GRCh38.91.gtf
-rw-r--r--@  1 yukkei  staff   366M May 25 15:24 Homo_sapiens.GRCh38.cdna.ncrna.fa
drwxr-xr-x   6 yukkei  staff   192B May 25 15:06 nanopolish
drwxr-xr-x  16 yukkei  staff   512B May 25 15:07 xpore
(base) yukkei@yukkeis-Mac-mini solve_bug_arabidopsis % xpore-dataprep \                            
--eventalign nanopolish/eventalign.txt \
--summary nanopolish/summary.txt \
--out_dir human_dataprep \
--genome --gtf_path_or_url Homo_sapiens.GRCh38.91.gtf --transcript_fasta_paths_or_urls Homo_sapiens.GRCh38.cdna.ncrna.fa --merge_transcript_id_version 
/Users/yukkei/opt/anaconda3/lib/python3.8/site-packages/xpore-1.0-py3.8.egg/xpore/scripts/dataprep.py:102: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  chunk_split['line_length'] = np.array(lines)
/Users/yukkei/opt/anaconda3/lib/python3.8/site-packages/xpore-1.0-py3.8.egg/xpore/scripts/dataprep.py:51: PerformanceWarning: indexing past lexsort depth may impact performance.
  pos_end += eventalign_result.loc[index]['line_length'].sum()
(base) yukkei@yukkeis-Mac-mini solve_bug_arabidopsis % ls -lh 
total 4203216
-rw-------@  1 yukkei  staff   215M May 20 16:31 Arabidopsis_thaliana.TAIR10.50.gtf
-rw-r--r--@  1 yukkei  staff    94M May 20 16:37 Arabidopsis_thaliana.TAIR10.cdna.all.fa
-rw-r--r--@  1 yukkei  staff   1.0G May 25 15:23 Homo_sapiens.GRCh38.91.gtf
-rw-r--r--@  1 yukkei  staff   366M May 25 15:24 Homo_sapiens.GRCh38.cdna.ncrna.fa
-rw-r--r--   1 yukkei  staff   315M May 25 15:29 Homo_sapiens.GRCh38.cdna.ncrna.fa.pickle
drwxr-xr-x   9 yukkei  staff   288B May 25 15:29 human_dataprep
drwxr-xr-x   6 yukkei  staff   192B May 25 15:06 nanopolish
drwxr-xr-x  16 yukkei  staff   512B May 25 15:07 xpore
(base) yukkei@yukkeis-Mac-mini solve_bug_arabidopsis % ls -lh human_dataprep 
total 382408
-rw-r--r--  1 yukkei  staff   141B May 25 15:29 data.index
-rw-r--r--  1 yukkei  staff   953K May 25 15:29 data.json
-rw-r--r--  1 yukkei  staff   145B May 25 15:29 data.log
-rw-r--r--  1 yukkei  staff    98B May 25 15:29 data.readcount
-rw-r--r--  1 yukkei  staff   6.3K May 25 15:29 eventalign.index
-rw-r--r--  1 yukkei  staff   142M May 25 15:29 transcript_id_version_merged.gtf
-rw-r--r--  1 yukkei  staff    41M May 25 15:29 transcript_id_version_merged.gtf.pickle
(base) yukkei@yukkeis-Mac-mini solve_bug_arabidopsis % xpore-dataprep \     
--eventalign nanopolish/arabidopsis_eventalign.txt \ 
--summary nanopolish/summary.txt \
--out_dir arabidopsis_dataprep \
--genome --gtf_path_or_url Arabidopsis_thaliana.TAIR10.50.gtf --transcript_fasta_paths_or_urls Arabidopsis_thaliana.TAIR10.cdna.all.fa 
/Users/yukkei/opt/anaconda3/lib/python3.8/site-packages/xpore-1.0-py3.8.egg/xpore/scripts/dataprep.py:102: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  chunk_split['line_length'] = np.array(lines)
/Users/yukkei/opt/anaconda3/lib/python3.8/site-packages/xpore-1.0-py3.8.egg/xpore/scripts/dataprep.py:51: PerformanceWarning: indexing past lexsort depth may impact performance.
  pos_end += eventalign_result.loc[index]['line_length'].sum()
(base) yukkei@yukkeis-Mac-mini solve_bug_arabidopsis % ls -lh
total 4394912
-rw-------@  1 yukkei  staff   215M May 20 16:31 Arabidopsis_thaliana.TAIR10.50.gtf
-rw-r--r--   1 yukkei  staff    10M May 25 15:31 Arabidopsis_thaliana.TAIR10.50.gtf.pickle
-rw-r--r--@  1 yukkei  staff    94M May 20 16:37 Arabidopsis_thaliana.TAIR10.cdna.all.fa
-rw-r--r--   1 yukkei  staff    83M May 25 15:31 Arabidopsis_thaliana.TAIR10.cdna.all.fa.pickle
-rw-r--r--@  1 yukkei  staff   1.0G May 25 15:23 Homo_sapiens.GRCh38.91.gtf
-rw-r--r--@  1 yukkei  staff   366M May 25 15:24 Homo_sapiens.GRCh38.cdna.ncrna.fa
-rw-r--r--   1 yukkei  staff   315M May 25 15:29 Homo_sapiens.GRCh38.cdna.ncrna.fa.pickle
drwxr-xr-x   7 yukkei  staff   224B May 25 15:31 arabidopsis_dataprep
drwxr-xr-x   9 yukkei  staff   288B May 25 15:29 human_dataprep
drwxr-xr-x   6 yukkei  staff   192B May 25 15:06 nanopolish
drwxr-xr-x  16 yukkei  staff   512B May 25 15:07 xpore
(base) yukkei@yukkeis-Mac-mini solve_bug_arabidopsis % ls -lh arabidopsis_dataprep 
total 2000
-rw-r--r--  1 yukkei  staff   106B May 25 15:31 data.index
-rw-r--r--  1 yukkei  staff   979K May 25 15:31 data.json
-rw-r--r--  1 yukkei  staff   109B May 25 15:31 data.log
-rw-r--r--  1 yukkei  staff    62B May 25 15:31 data.readcount
-rw-r--r--  1 yukkei  staff   5.3K May 25 15:31 eventalign.index

I also did a p-value ranking comparison between the xpore-dataprep on GoekeLab/xpore and the xpore-dataprep on yuukiiwa/xpore, and the two versions of the datapreps generated the same xpore-diffmod results with the demo-testdataset.

I have also crosschecked the xpore-diffmod results with and without transcript versions with the demo test-dataset, which show the same results.