GoekeLab / xpore

Identification of differential RNA modifications from nanopore direct RNA sequencing
https://xpore.readthedocs.io/
MIT License
132 stars 22 forks source link

Genome mapping for Arabidopsis in Xpore #55

Closed Annie-GW closed 9 months ago

Annie-GW commented 3 years ago

Dear Ploy,

We are stuck at the genome mapping part in Xpore. I know this is a big favor to ask but would it be possible to help us to look at the script and the error message? Any advice would be greatly appreciated. Thank you so much.

xpore-dataprep \ --eventalign ../R00122/nanopolish/R00122.nanopolish.xpore.tsv \ --summary ../R00122/nanopolish/R00122.nanopolish.xpore.summary.txt \ --out_dir ../R00122/dataprep \ --customised_genome \ --reference_name Arabidopsis_thaliana.TAIR10.dna.toplevel \ --annotation_name Arabidopsis_thaliana.TAIR10.49.gtf \ --gtf_path_or_url ../ref/Arabidopsis_thaliana.TAIR10.49.gtf \ --n_processes 32 \ --genome

The error message:

/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:112: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead

chrishendra93 commented 3 years ago

Hi Annie,

I think that is only a warning thrown out by pandas and not an error message, the program should still be running. Did you see an increase in the number of lines / file size from data.json?

Annie-GW commented 3 years ago

Hi Chris,

Thank you but the final data.json file is actually empty. We are not sure what is going on. Do you spot any mistakes in the command?

Kind regards, Annie

chrishendra93 commented 3 years ago

so during the daraprep xpore will index your eventalign.txt first before it writes something to data.json. Can you check the eventalign.index file? Try wc -l eventalign.index to see if it has anything in there and maybe do it a couple of times to see if the number of lines increases

Annie-GW commented 3 years ago

Hi Chris, We have re-run the analysis following the command posted previously. We can obtain the eventalign.index file (51.7Mb) and this is the message that we get :

xx/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:112: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy chunk_split['line_length'] = np.array(lines) Traceback (most recent call last): File "xx/envs/xpore/bin/xpore-dataprep", line 8, in sys.exit(main()) File "xx/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py", line 653, in main reference_name=reference_name, UnboundLocalError: local variable 'reference_name' referenced before assignment

chrishendra93 commented 3 years ago

Hi Annie, thanks for running it again

It seems that the indexing works successfully, and the error seems to indicate that you did not provide the fasta file using the option --transcript_fasta_paths_or_urls

Also in the next run, since you have generated the index file, you can supply the --skip_eventalign_indexing argument as well to skip the indexing

Annie-GW commented 3 years ago

Thanks, Chris. We have re-run the analysis according to your suggestion. The resulting data.json file is only 235 kb containing 40+ genes! When we did the mapping to transcriptome, the file size for data.json is 3.4 GB. Here is the log for the run:

xpore-dataprep \ --eventalign xx/R00122/nanopolish/R00122.nanopolish.xpore.tsv \ --summary xx/R00122/nanopolish/R00122.nanopolish.xpore.summary.txt \ --out_dir xx/R00122/dataprep \ --customised_genome \ --reference_name Arabidopsis_thaliana.TAIR10.dna.toplevel \ --annotation_name Arabidopsis_thaliana.TAIR10.49.gtf \ --gtf_path_or_url /ref/Arabidopsis_thaliana.TAIR10.49.gtf \ --transcript_fasta_paths_or_urls /ref-transcript-fa/reference_transcriptome.fa \ --n_processes 32 \ --genome

running customised genome /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:112: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy chunk_split['line_length'] = np.array(lines) INFO:pyensembl.sequence_data:Loaded sequence dictionary from /ref-transcript-fa/reference_transcriptome.fa.pickle

yuukiiwa commented 3 years ago

Hi @Annie-GW, thanks for showing the log of the mapping to genome (using gene id and genome coordinates)! To map to transcriptome (using transcript id and transcriptome coordinates) you don't need the --genome parameter and the --customised_genome and related fasta and gtf parameters:

xpore-dataprep \
--eventalign ../R00122/nanopolish/R00122.nanopolish.xpore.tsv \
--summary ../R00122/nanopolish/R00122.nanopolish.xpore.summary.txt \ 
--out_dir ../R00122/dataprep \
--n_processes 32

Thanks!

Annie-GW commented 3 years ago

Thanks, Yuk Kei. We have no problem mapping to transcriptome. It is the mapping to genome part that we are stuck.

yuukiiwa commented 3 years ago

Hi Annie, which reference fasta and annotation gtf are you using? I'm not sure whether non-ensembl references and annotations could cause the problem. There are several things helpful to check upstream of xpore-dataprep for the mapping to genome:

6ge6 commented 3 years ago

hi yuukiiwa. thanks for your suggestion. we tried it again followed your 3 steps. we find when the index finished, the program just output an empty data.json file and exit.

log shows below.

running customised genome
INFO:pyensembl.sequence_data:Loaded sequence dictionary from Arabidopsis_thaliana.TAIR10.cdna.all.fa.pickle

command shows below

xpore-dataprep \
            --eventalign nanopolish/R001.nanopolish.xpore.tsv \
            --summary nanopolish/R001.nanopolish.xpore.summary.txt \
            --out_dir dataprep \
            --skip_eventalign_indexing \
            --n_processes 32 \
            --customised_genome \
            --reference_name Arabidopsis_thaliana.TAIR10.cdna.all.fa \
            --annotation_name Arabidopsis_thaliana.TAIR10.50.gtf \
            --gtf_path_or_url ref/Arabidopsis_thaliana.TAIR10.50.gtf \
            --transcript_fasta_paths_or_urls ref/Arabidopsis_thaliana.TAIR10.cdna.all.fa \
            --genome

looking forward to get your response. thanks

yuukiiwa commented 3 years ago

Hi @6ge6, sorry for the delayed reply! I think I know where the problem is now:

Here is a line from the Arabidopsis gtf file (transcript_id "AT1G01010.1"):

1   araport11   transcript  3631    5899    .   +   .   gene_id "AT1G01010"; transcript_id "AT1G01010.1"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding";

and here is a line from the Arabidopsis cDNA fasta file (transcript ID includes version):

>AT1G19090.1 cdna chromosome:TAIR10:1:6589835:6592762:1 gene:AT1G19090 gene_biotype:nontranslating_CDS transcript_biotype:nontranslating_CDS gene_symbol:RKF2 description:receptor-like serine/threonine kinase 2 [Source:TAIR;Acc:AT1G19090]

Here is a line from the Human gtf file (transcript_id "ENST00000456328"; transcript_version "2"):

1   havana  transcript  11869   14409   .   +   .   gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; tag "basic"; transcript_support_level "1";

and here is a line from the Human cDNA fasta file (transcript ID includes version):

>ENST00000434970.2 cdna chromosome:GRCh38:14:22439007:22439015:1 gene:ENSG00000237235.2 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:TRDD2 description:T-cell receptor delta diversity 2 [Source:HGNC Symbol;Acc:HGNC:12255]

As xPore was developed using human samples, to facilitate the transcriptome to genome mapping, we removed the transcript version, which does not make a difference in processing human samples. However, in your case, our removing the transcript version causes it not being to match with the gtf transcript_id entry, which contains the transcript version. We will definitely have to change this, and hopefully we can give you an update a week from now.

Thank you so much for using xPore on Arabidopsis! Or not we wouldn't have known that its incompatibility with the Arabidopsis reference/annotation.

6ge6 commented 3 years ago

Hi @yuukiiwa , Great, thanks for your help. looking forward your new version. Regards,

yuukiiwa commented 3 years ago

Hi @6ge6, thank you for waiting! I have updated the xpore implementation on my fork, which is now compatible with Arabidopsis and Human Ensembl references. It is now pull requesting into the GoekeLab/xpore, but will take a while to get the changes reviewed. I have tested the new implementation on my branch, and it produces the same results on the demo test-dataset. @chrishendra93 and I think that it should be fine if you use the xpore on my fork for now as it will eventually be merged into GoekeLab/xpore. To install the xpore on my fork, first you have to clone it:

git clone https://github.com/yuukiiwa/xpore.git

Then, install it:

sudo python3 setup.py install

To run, xpore-dataprep using Arabidopsis references:

xpore-dataprep \
--eventalign nanopolish/arabidopsis_eventalign.txt \
--summary nanopolish/summary.txt \
--out_dir arabidopsis_dataprep \
--genome --gtf_path_or_url Arabidopsis_thaliana.TAIR10.50.gtf --transcript_fasta_paths_or_urls Arabidopsis_thaliana.TAIR10.cdna.all.fa

I have tried making a mock demo eventalign.txt using Arabidopsis transcript ids to test whether using Arabidopsis fasta and gtf works, and here is the dataprep directory:

ls -lh arabidopsis_dataprep 
total 2024
-rw-r--r--  1 yukkei  staff   104B May 27 11:33 data.index
-rw-r--r--  1 yukkei  staff   979K May 27 11:33 data.json
-rw-r--r--  1 yukkei  staff   109B May 27 11:33 data.log
-rw-r--r--  1 yukkei  staff    62B May 27 11:33 data.readcount
-rw-r--r--  1 yukkei  staff   5.3K May 27 11:33 eventalign.index

Thank you! Hope this helps!

YCCHEN23 commented 2 years ago

Hi, I encounter the same problem while performing version 2.1.

So, only the version of @yuukiiwa 's fork can deal with the Arabidopsis data?

Thanks!