GoekeLab / xpore

Identification of differential RNA modifications from nanopore direct RNA sequencing
https://xpore.readthedocs.io/
MIT License
134 stars 22 forks source link

Position to coordinate #144

Closed ynren1020 closed 11 months ago

ynren1020 commented 2 years ago

Hi there,

I have a quick question about how to find the coordinate for the modification site. The output from xpore is the position relative to the start position of a transcript (I used gencode transcriptome fasta file). How could I convert the position to coordinate? Right now, I have to do it manually, blat the sequence and find it. Is there a way we can do it in a programming way?

Thank you so much for your help!

Yanan

yuukiiwa commented 2 years ago

Hi Yanan(@ynren1020),

If I understand your question correctly, you want to convert the transcript position to the genome coordinate.

You can use the --genome flag with gtf and fasta provided through the --gtf_or_gff and --transcript_fasta, respectively, to output the genome coordinate of a site. Here is an example command from xpore's documentation:

xpore dataprep \
--eventalign nanopolish/eventalign.txt \
--gtf_or_gff ../../demo.gtf \
--transcript_fasta ../../demo.fa \
--out_dir dataprep \
--genome

Thanks!

Best wishes, Yuk Kei

ynren1020 commented 2 years ago

Hi Yuk Kei,

Thanks a lot for your reply. I am trying your suggestions on one of my samples. However, I remembered that when I used all the arguments a few weeks ago, the output data.json was empty, I used gencode.v38.transcripts.fa and gencode.v38.annotation.gtf for xpore --dataprep command. I removed --genome and --gtf or gff command, the data.json output can be successfully produced. I do not know why this happened.

And also for the xpore --dataprep step, what are the requirements of arguments? For the gtf and fasta files, do you have any recommendations on how to select these files? If users only used transcripts.fasta file, is there an optional script that can convert modification site position to genome coordinate?

Thank you so much for all your support!

Yanan

On Tue, Apr 26, 2022 at 2:35 AM Yuk Kei Wan @.***> wrote:

Hi @.*** https://github.com/ynren1020),

If I understand your question correctly, you want to convert the transcript position to the genome coordinate.

You can use the --genome flag with gtf and fasta provided through the --gtf_or_gff and --transcript_fasta, respectively, to output the genome coordinate of a site. Here is an example command from xpore's documentation https://xpore.readthedocs.io/en/latest/quickstart.html:

xpore dataprep \ --eventalign nanopolish/eventalign.txt \ --gtf_or_gff ../../demo.gtf \ --transcript_fasta ../../demo.fa \ --out_dir dataprep \ --genome

Thanks!

Best wishes, Yuk Kei

— Reply to this email directly, view it on GitHub https://github.com/GoekeLab/xpore/issues/144#issuecomment-1109451753, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGFIHCO3GUAHGQK7RQU3W63VG6MEZANCNFSM5UCN5XGA . You are receiving this because you were mentioned.Message ID: @.***>

yuukiiwa commented 2 years ago

Hi Yanan(@ynren1020),

No problem! Glad to help! Do you mind sharing the top 10 lines of your eventalign.txt file by head eventalign.txt, please? You can also check whether you used the same gencode.v38.transcripts.fa for upstream processing (e.g. alignment).

Thanks!

Best wishes, Yuk Kei

ynren1020 commented 2 years ago

Hi Yuk,

Thanks so much for your quick response. I attached the first few lines of eventalign.txt for your reference. By the way, I downloaded transcript fasta file (gencode.v38.transcripts.fa.gz http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_38/gencode.v38.transcripts.fa.gz) and gtf file (gencode.v38.annotation.gtf.gz http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_38/gencode.v38.annotation.gtf.gz) though this link ( http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_38/). I also reinstalled most updated xpore through github, the data.json is still empty too.

Thanks a lot for all your help! I appreciate it!

Yanan

On Wed, Apr 27, 2022 at 12:31 AM Yuk Kei Wan @.***> wrote:

Hi @.*** https://github.com/ynren1020),

No problem! Glad to help! Do you mind sharing the top 10 lines of your eventalign.txt file by head eventalign.txt, please? You can also check whether you used the same gencode.v38.transcripts.fa for upstream processing (e.g. alignment).

Thanks!

Best wishes, Yuk Kei

— Reply to this email directly, view it on GitHub https://github.com/GoekeLab/xpore/issues/144#issuecomment-1110560371, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGFIHCJVPN3WW2OAXTK4LVLVHDGKZANCNFSM5UCN5XGA . You are receiving this because you were mentioned.Message ID: @.***>

yuukiiwa commented 2 years ago

Hi Yanan(@ynren1020),

Do you mind pasting the several lines of your eventalign.txt file directly in the message, please? I don't see the attachment of the first few lines of your eventalign.txt.

Thanks!

Best wishes, Yuk Kei

ynren1020 commented 2 years ago

Hi Yuk,

Thanks so much for all your support! Please let me know what else you need from me.

Here are the first few lines: contig position reference_kmer read_index strand event_index event_level_mean event_stdv event_length model_kmer model_mean model_stdv standardized_level start_idx end_idx ENST00000457540.1|ENSG00000225630.1|OTTHUMG00000002336.1|OTTHUMT00000006718.1|MTND2P28-201|MTND2P28|1044|unprocessed_pseudogene| 12 GCCCA 26 t 2 61.12 1.059 0.00598 GCCCA 64.63 2.07 -1.43 31114 31132 ENST00000457540.1|ENSG00000225630.1|OTTHUMG00000002336.1|OTTHUMT00000006718.1|MTND2P28-201|MTND2P28|1044|unprocessed_pseudogene| 13 CCCAA 26 t 3 72.74 1.304 0.00432 CCCAA 73.43 2.11 -0.27 31101 31114 ENST00000457540.1|ENSG00000225630.1|OTTHUMG00000002336.1|OTTHUMT00000006718.1|MTND2P28-201|MTND2P28|1044|unprocessed_pseudogene| 14 CCAAC 26 t 4 85.45 0.738 0.00432 CCAAC 85.16 3.02 0.08 31088 31101 ENST00000457540.1|ENSG00000225630.1|OTTHUMG00000002336.1|OTTHUMT00000006718.1|MTND2P28-201|MTND2P28|1044|unprocessed_pseudogene| 14 CCAAC 26 t 5 85.89 0.428 0.00232 CCAAC 85.16 3.02 0.2 31081 31088 ENST00000457540.1|ENSG00000225630.1|OTTHUMG00000002336.1|OTTHUMT00000006718.1|MTND2P28-201|MTND2P28|1044|unprocessed_pseudogene| 14 CCAAC 26 t 6 84.2 1.3 0.00863 CCAAC 85.16 3.02 -0.27 31055 31081 ENST00000457540.1|ENSG00000225630.1|OTTHUMG00000002336.1|OTTHUMT00000006718.1|MTND2P28-201|MTND2P28|1044|unprocessed_pseudogene| 14 CCAAC 26 t 7 80.7 0.737 0.00299 CCAAC 85.16 3.02 -1.25 31046 31055 ENST00000457540.1|ENSG00000225630.1|OTTHUMG00000002336.1|OTTHUMT00000006718.1|MTND2P28-201|MTND2P28|1044|unprocessed_pseudogene| 14 CCAAC 26 t 8 83.66 1.168 0.00697 CCAAC 85.16 3.02 -0.42 31025 31046 ENST00000457540.1|ENSG00000225630.1|OTTHUMG00000002336.1|OTTHUMT00000006718.1|MTND2P28-201|MTND2P28|1044|unprocessed_pseudogene| 14 CCAAC 26 t 9 84.59 1.412 0.02756 CCAAC 85.16 3.02 -0.16 30942 31025 ENST00000457540.1|ENSG00000225630.1|OTTHUMG00000002336.1|OTTHUMT00000006718.1|MTND2P28-201|MTND2P28|1044|unprocessed_pseudogene| 14 CCAAC 26 t 10 84.5 1.214 0.01129 CCAAC 85.16 3.02 -0.19 30908 30942

Best,

Yanan

On Thu, Apr 28, 2022 at 9:03 PM Yuk Kei Wan @.***> wrote:

Hi @.*** https://github.com/ynren1020),

Do you mind pasting the several lines of your eventalign.txt file directly in the message, please? I don't see the attachment of the first few lines of your eventalign.txt.

Thanks!

Best wishes, Yuk Kei

— Reply to this email directly, view it on GitHub https://github.com/GoekeLab/xpore/issues/144#issuecomment-1112813232, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGFIHCM3ZHTNUS3K6SLAGEDVHM7QJANCNFSM5UCN5XGA . You are receiving this because you were mentioned.Message ID: @.***>

yuukiiwa commented 2 years ago

Hi Yanan(@ynren1020),

The first column of your eventalign.txt file should be only the transcript_id (e.g. ENST00000457540.1), so you will have to remove the information after the transcript_id. Here is an example:

Your original eventalign.txt looks like the following:

ENST00000457540.1|ENSG00000225630.1|OTTHUMG00000002336.1|OTTHUMT00000006718.1|MTND2P28-201|MTND2P28|1044|unprocessed_pseudogene|
14 CCAAC 26 t 10 84.5 1.214 0.01129 CCAAC 85.16 3.02 -0.19 30908 30942

and you will have to change it to the following:

ENST00000457540.1 14 CCAAC 26 t 10 84.5 1.214 0.01129 CCAAC 85.16 3.02 -0.19 30908 30942

Thanks!

Best wishes, Yuk Kei

ynren1020 commented 2 years ago

Hi Yuk,

Thanks a lot for your help. I tried your suggestions and removed other information in the first column of the eventalign.txt file (only kept transcript id). However, it seems that the output of xpore --dataprep became empty again when I used all the arguments.

Here is the first few lines of new eventalign.txt file: contig position reference_kmer read_index strand event_index event_level_mean event_stdv event_length model_kmer model_mean model_stdv standardized_level start_idx end_idx ENST00000414273.1 13 GTTGA 28 t 526 93.36 2.904 0.00631 GTTGA 88.48 2.65 1.45 12144 12163 ENST00000414273.1 13 GTTGA 28 t 527 88.97 0.977 0.00232 GTTGA 88.48 2.65 0.15 12137 12144 ENST00000414273.1 13 GTTGA 28 t 528 92.35 1.991 0.00266 GTTGA 88.48 2.65 1.15 12129 12137 ENST00000414273.1 14 TTGAC 28 t 529 97.67 2.338 0.00963 TTGAC 104.20 4.49 -1.15 12100 12129 ENST00000414273.1 15 TGACT 28 t 530 115.44 2.635 0.00631 TGACT 121.90 4.79 -1.07 12081 12100 ENST00000414273.1 15 TGACT 28 t 531 112.16 2.676 0.00531 TGACT 121.90 4.79 -1.60 12065 12081 ENST00000414273.1 15 TGACT 28 t 532 117.99 4.889 0.00598 TGACT 121.90 4.79 -0.64 12047 12065

Here is the command I have used: xpore dataprep \

--eventalign ${refer2}/C1_eventalign_reads_index2.update.cut2.tsv \ # first column only has transcript id, other columns are the same as before

--gtf_or_gff ../tombo/reference/gencode.v38.annotation.gtf \

--transcript_fasta ./reference/gencode.v38.transcripts.fa \

--out_dir ./C1/dataprep2 \

--genome

--n_processes 40

Here is the output screenshot: -rw-r--r-- 1 54M May 16 13:45 eventalign.index -rw-r--r-- 1 0 May 16 13:46 data.json # empty output -rw-r--r-- 1 14 May 16 13:46 data.index -rw-r--r-- 1 12 May 16 13:46 data.readcount -rw-r--r-- 1 45 May 16 13:46 data.log

Thanks a lot for your further suggestions, I appreciate all your generous help!

Yanan

On Tue, May 3, 2022 at 9:07 PM Yuk Kei Wan @.***> wrote:

Hi @.*** https://github.com/ynren1020),

The first column of your eventalign.txt file should be only the transcript_id (e.g. ENST00000457540.1), so you will have to remove the information after the transcript_id. Here is an example:

Your original eventalign.txt looks like the following:

ENST00000457540.1|ENSG00000225630.1|OTTHUMG00000002336.1|OTTHUMT00000006718.1|MTND2P28-201|MTND2P28|1044|unprocessed_pseudogene| 14 CCAAC 26 t 10 84.5 1.214 0.01129 CCAAC 85.16 3.02 -0.19 30908 30942

and you will have to change it to the following:

ENST00000457540.1 14 CCAAC 26 t 10 84.5 1.214 0.01129 CCAAC 85.16 3.02 -0.19 30908 30942

Thanks!

Best wishes, Yuk Kei

— Reply to this email directly, view it on GitHub https://github.com/GoekeLab/xpore/issues/144#issuecomment-1116864738, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGFIHCJARYITDUY4LPN4PVDVIHLU5ANCNFSM5UCN5XGA . You are receiving this because you were mentioned.Message ID: @.***>

ynren1020 commented 2 years ago

Hi Yuk,

I am wondering if some example code of preprocess data can be posted on the quick start guide web page. For example, instead of starting with xpore --dataprep as the first step. It would be better to have some code for preprocess the raw data to produce eventalign.txt. I know there is a separate web page for data preprocess, but that is a general version. How the example data in the xpore --dataprep produced can be very helpful. For example, adding guppy_basecaller and minimap, the format of gtf, fasta file be mentioned in detail.

Thanks!

Yanan

yuukiiwa commented 2 years ago

Hi @ynren1020,

Sorry for the delayed reply! (I just saw your comment)

You can find the preprocessing steps prior to running xpore dataprep here: https://xpore.readthedocs.io/en/latest/preparation.html

Thanks!

Best wishes, Yuk Kei