Here are the xpore-dataprep runs with Human and Arabidopsis references and annotations:
(base) yukkei@yukkeis-Mac-mini solve_bug_arabidopsis % ls -lh
total 3557856
-rw-------@ 1 yukkei staff 215M May 20 16:31 Arabidopsis_thaliana.TAIR10.50.gtf
-rw-r--r--@ 1 yukkei staff 94M May 20 16:37 Arabidopsis_thaliana.TAIR10.cdna.all.fa
-rw-r--r--@ 1 yukkei staff 1.0G May 25 15:23 Homo_sapiens.GRCh38.91.gtf
-rw-r--r--@ 1 yukkei staff 366M May 25 15:24 Homo_sapiens.GRCh38.cdna.ncrna.fa
drwxr-xr-x 6 yukkei staff 192B May 25 15:06 nanopolish
drwxr-xr-x 16 yukkei staff 512B May 25 15:07 xpore
(base) yukkei@yukkeis-Mac-mini solve_bug_arabidopsis % xpore-dataprep \
--eventalign nanopolish/eventalign.txt \
--summary nanopolish/summary.txt \
--out_dir human_dataprep \
--genome --gtf_path_or_url Homo_sapiens.GRCh38.91.gtf --transcript_fasta_paths_or_urls Homo_sapiens.GRCh38.cdna.ncrna.fa --merge_transcript_id_version
/Users/yukkei/opt/anaconda3/lib/python3.8/site-packages/xpore-1.0-py3.8.egg/xpore/scripts/dataprep.py:102: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
chunk_split['line_length'] = np.array(lines)
/Users/yukkei/opt/anaconda3/lib/python3.8/site-packages/xpore-1.0-py3.8.egg/xpore/scripts/dataprep.py:51: PerformanceWarning: indexing past lexsort depth may impact performance.
pos_end += eventalign_result.loc[index]['line_length'].sum()
(base) yukkei@yukkeis-Mac-mini solve_bug_arabidopsis % ls -lh
total 4203216
-rw-------@ 1 yukkei staff 215M May 20 16:31 Arabidopsis_thaliana.TAIR10.50.gtf
-rw-r--r--@ 1 yukkei staff 94M May 20 16:37 Arabidopsis_thaliana.TAIR10.cdna.all.fa
-rw-r--r--@ 1 yukkei staff 1.0G May 25 15:23 Homo_sapiens.GRCh38.91.gtf
-rw-r--r--@ 1 yukkei staff 366M May 25 15:24 Homo_sapiens.GRCh38.cdna.ncrna.fa
-rw-r--r-- 1 yukkei staff 315M May 25 15:29 Homo_sapiens.GRCh38.cdna.ncrna.fa.pickle
drwxr-xr-x 9 yukkei staff 288B May 25 15:29 human_dataprep
drwxr-xr-x 6 yukkei staff 192B May 25 15:06 nanopolish
drwxr-xr-x 16 yukkei staff 512B May 25 15:07 xpore
(base) yukkei@yukkeis-Mac-mini solve_bug_arabidopsis % ls -lh human_dataprep
total 382408
-rw-r--r-- 1 yukkei staff 141B May 25 15:29 data.index
-rw-r--r-- 1 yukkei staff 953K May 25 15:29 data.json
-rw-r--r-- 1 yukkei staff 145B May 25 15:29 data.log
-rw-r--r-- 1 yukkei staff 98B May 25 15:29 data.readcount
-rw-r--r-- 1 yukkei staff 6.3K May 25 15:29 eventalign.index
-rw-r--r-- 1 yukkei staff 142M May 25 15:29 transcript_id_version_merged.gtf
-rw-r--r-- 1 yukkei staff 41M May 25 15:29 transcript_id_version_merged.gtf.pickle
(base) yukkei@yukkeis-Mac-mini solve_bug_arabidopsis % xpore-dataprep \
--eventalign nanopolish/arabidopsis_eventalign.txt \
--summary nanopolish/summary.txt \
--out_dir arabidopsis_dataprep \
--genome --gtf_path_or_url Arabidopsis_thaliana.TAIR10.50.gtf --transcript_fasta_paths_or_urls Arabidopsis_thaliana.TAIR10.cdna.all.fa
/Users/yukkei/opt/anaconda3/lib/python3.8/site-packages/xpore-1.0-py3.8.egg/xpore/scripts/dataprep.py:102: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
chunk_split['line_length'] = np.array(lines)
/Users/yukkei/opt/anaconda3/lib/python3.8/site-packages/xpore-1.0-py3.8.egg/xpore/scripts/dataprep.py:51: PerformanceWarning: indexing past lexsort depth may impact performance.
pos_end += eventalign_result.loc[index]['line_length'].sum()
(base) yukkei@yukkeis-Mac-mini solve_bug_arabidopsis % ls -lh
total 4394912
-rw-------@ 1 yukkei staff 215M May 20 16:31 Arabidopsis_thaliana.TAIR10.50.gtf
-rw-r--r-- 1 yukkei staff 10M May 25 15:31 Arabidopsis_thaliana.TAIR10.50.gtf.pickle
-rw-r--r--@ 1 yukkei staff 94M May 20 16:37 Arabidopsis_thaliana.TAIR10.cdna.all.fa
-rw-r--r-- 1 yukkei staff 83M May 25 15:31 Arabidopsis_thaliana.TAIR10.cdna.all.fa.pickle
-rw-r--r--@ 1 yukkei staff 1.0G May 25 15:23 Homo_sapiens.GRCh38.91.gtf
-rw-r--r--@ 1 yukkei staff 366M May 25 15:24 Homo_sapiens.GRCh38.cdna.ncrna.fa
-rw-r--r-- 1 yukkei staff 315M May 25 15:29 Homo_sapiens.GRCh38.cdna.ncrna.fa.pickle
drwxr-xr-x 7 yukkei staff 224B May 25 15:31 arabidopsis_dataprep
drwxr-xr-x 9 yukkei staff 288B May 25 15:29 human_dataprep
drwxr-xr-x 6 yukkei staff 192B May 25 15:06 nanopolish
drwxr-xr-x 16 yukkei staff 512B May 25 15:07 xpore
(base) yukkei@yukkeis-Mac-mini solve_bug_arabidopsis % ls -lh arabidopsis_dataprep
total 2000
-rw-r--r-- 1 yukkei staff 106B May 25 15:31 data.index
-rw-r--r-- 1 yukkei staff 979K May 25 15:31 data.json
-rw-r--r-- 1 yukkei staff 109B May 25 15:31 data.log
-rw-r--r-- 1 yukkei staff 62B May 25 15:31 data.readcount
-rw-r--r-- 1 yukkei staff 5.3K May 25 15:31 eventalign.index
I also did a p-value ranking comparison between the xpore-dataprep on GoekeLab/xpore and the xpore-dataprep on yuukiiwa/xpore, and the two versions of the datapreps generated the same xpore-diffmod results with the demo-testdataset.
Here are the
xpore-dataprep
runs with Human and Arabidopsis references and annotations:I also did a p-value ranking comparison between the
xpore-dataprep
on GoekeLab/xpore and thexpore-dataprep
on yuukiiwa/xpore, and the two versions of the datapreps generated the same xpore-diffmod results with the demo-testdataset.I have also crosschecked the
xpore-diffmod
results with and without transcript versions with the demo test-dataset, which show the same results.