m6a detection in nanopore sequencing data

kwonej0617 commented 1 year ago

Hi, Developer.

Thank you for developing a useful tool. I wanted to identify m6a sites from nanopore sequencing data. Before analyzing my data, I tried going through the steps in the document https://dieterich-lab.github.io/JACUSA2helper/articles/web_only/JACUSA2helper-nanopore.html.

I was able to perform 'General workflow' parts successfully without errors, but I had an issue in running 'Use NMF on Nanopore data' part. While trying to figure out the error, I checked your paper and code and found that you used the consensus m6a sites from Boulias, Koertel, and * Koh as training data.

I wonder if I have to download the miCLIP data to train the model and run the R code in 'Use NMF on Nanopore data' part before identifying m6a sites. Otherwise, do you provide a pre-trained model in JACUSA2 software?

I am looking forward to hearing from you.

Thank you!

piechottam commented 1 year ago

Dear kwonej0617, sry for the late response.

We have updated JACUSA2helper (the last release before 2.0.0)

Please checkout: https://dieterich-lab.github.io/JACUSA2helper/articles/web_only/JACUSA2helper-nanopore.html

We have added:

data(m6a_miclip)
data(m6a_nmf_results)

kwonej0617 commented 1 year ago

I really appreciate your update!

I was going to try running the updated script, but I have a question regarding input files. In the new script, it loaded two input files , WT_vs_KO_2samp_RC22_call2_result.out.gz and WT100_vs_WT0_RC22_call2_result.out.gz.

However, the files in this link (https://zenodo.org/record/5930729#.Y48Wz3ZBxPb) have different names which was used in the previous script, WT_vs_KO_call2_result.out.gz (JACUSA2 output), WT_vs_realIVT_v202_call2_result.out.gz (JACUSA2 output).

I just wonder if I can just change the file names in the new script with those two file names above or it requires new files (WT_vs_KO_2samp_RC22_call2_result.out.gz and WT100_vs_WT0_RC22_call2_result.out.gz) to run the new script.

Thank you for your help!

piechottam commented 1 year ago

The files are the same. You can change the names. I'll fix the names and update the vignette.

Thank you for your feedback!

kwonej0617 commented 1 year ago

Hi, I tried running the updated script. However, I got an error. Do you know what the problem is? Thank you for your help!

piechottam commented 1 year ago

Sry, you need to update the JACUSA2helper package. I'll add a note saying what the minimal version of JACUSA2helper is.

kwonej0617 commented 1 year ago

Thanks! It looks like the script is running well!

Actually, I wanted to make an output file including m6a modification sites as you showed in Additional file 4, 13059_2022_2676_MOESM4_ESM.xlsx from supplementary materials. Since I am quite unexperienced in R, I am not pretty sure in which step in your document (https://dieterich-lab.github.io/JACUSA2helper/articles/web_only/JACUSA2helper-nanopore.html#evaluation) I could make the result like that. Could you please help me with that?

Thank you so much!

piechottam commented 1 year ago

Sry for the late reply - was "busy" with corona...

You can use the following package library("writexl") to export your data frame to excel sheet.

Given the score_matrix from https://dieterich-lab.github.io/JACUSA2helper/articles/web_only/JACUSA2helper-nanopore.html#evaluation:

# get the coordinates of the sites
df <- as.data.frame(GRanges(rownames(score_matrix)))
rownames(df) <- NULL
df$type <- "Nanopore"
df <- df[, c("seqnames", "start", "end", "type", "strand")]
colnames(df)[1] <- "chr"

# extract the score etc.
df <- cbind(df, as.data.frame(score_matrix[, c("score", "motif")]))
df$DRACH <- "no"
df$DRACH[grep("[AGT][AG]AC[ACT]", df$motif)] <- "yes"

# save
library("writexl")
write_xlsx(df, "sites.xlsx")

Hope that helps.

kwonej0617 commented 1 year ago

Thank you for your help!!!

kwonej0617 commented 1 year ago

Hi, Your script to write the output file was really helpful. But the very last code, write_xlsx(df, "sites.xlsx") threw an error saying Error: Error in libxlsxwriter: 'Worksheet row or column index out of range.’ So I generated output in text format. The reason why I got the error will be that excel can only 1,048,575 rows in a sheet, but the number of rows in df has more rows than that limit.

dim(df) [1] 2286021 8

Also, the Nanopore analysis document required two files, WT_vs_KO_call2_result.out.gz and WT100_vs_WT0_call2_result.out.gz for the analysis. Because each file was generated by comparing the error profile between WT and KO, or WT100 and WT0, I assumed only either of jacusa2 output files can be utilized to identify m6a sites. I wondered why it required two files in the analysis.

Also, I wanted to learn which minimap2 options you used to get your bam file. I converted bam to sam format you uploaded here: https://zenodo.org/record/5940218#.Y7oRsnbMJP, then found the following command line.

_@PG ID:minimap2 PN:minimap2 VN:2.17-r941 CL:minimap2 -t 5 --MD -ax splice --junc-bonus 1 -k14 --secondary=no --junc-bed /home/cdieterich/index/final_annotation_96.bed -ub /home/cdieterich/index/GRCh38_90_Niels_Gehring/minimap2_2.15/GRCh38_90_NielsGehring.mmi HEK293T-KO-rep2 .fastq.gz

I just wondered how you get the file, final_annotation_96.bed, and GRCh38_90_Niels_Gehring.mmi. Could you please share the information on those files and how you get them?

Lastly, could you share the command line you used to run guppy? I wanted to run guppy with the same setting you did

I really appreciate your help! Happy new year.

piechottam commented 1 year ago

This is a bug on our side - apparently we missed the warning (max fields per sheet) when we generated the XLS - THANK YOU FOR YOUR FEEDBACK!

Please find the annotation and index in our cloud (https://data.dieterichlab.org/s/RSmk29jxkqfNTXw)
Additionally, here is the corrected "score_matrix.rds" which corresponds to the object before it is converted to XLS
Guppy was run with: guppy_basecaller --verbose_logs \ --compress_fastq \ --fast5_out \ -r \ -i ${1} \ -s ${2} \ --flowcell FLO-MIN106 \ --kit SQK-RNA002 \ --num_callers 4 \ --min_qscore 7 \ --gpu_runners_per_device 4 \ --cpu_threads_per_caller 1 \ --device cuda:0

Thank you for your feedback!

kwonej0617 commented 1 year ago

Thank you so much for your response! So, based on what you said, because you ignored the warning message in generating excel, the total number of m6A sites I got (2286021) would be correct, right?

Also, I want to get the same result as you did in additional file 4. Based on your command lines in the supplementary file, I guess you generated two JACUSA2 output files, WT_vs_KO_call2_result.out.gz, and WT100_vs_WT0_call2_result.out.gz used in JACUSA2helper example (https://dieterich-lab.github.io/JACUSA2helper/articles/web_only/JACUSA2helper-nanopore.html) with bam files that were processed with xPore fast5 dataset as follows:

WT_vs_KO_call2_result.out.gz: HEK293T-WT-rep2, HEK293T-WT-rep3, HEK293T-KO-rep2, HEK293T-KO-rep3
WT100_vs_WT0_call2_result.out.gz - HEK293T-WT-100-rep1, HEK293T-WT-100-rep2, HEK293T-WT-100-rep3, HEK293T-WT-0-rep1, HEK293T-WT-0-rep2

Could you confirm that this is correct? If so, it looks like it requires lots of fast5 datasets. By chance, do you have m6A output data from JACUSA2 and JACUSA2helper pipeline that use the minimum number of fast5 datasets?

I really appreciate your help!

piechottam commented 1 year ago

So, based on what you said, because you ignored the warning message in generating excel, the total number of m6A sites I got >(2286021) would be correct, right?

Those are predicted/candidate sites - the XLS is sorted by NMF score. m6A sites are expected to be at the top of the file. A possible threshold is presented in the "Evaluate" subsection of the vignette.

Could you confirm that this is correct? If so, it looks like it requires lots of fast5 datasets. By chance, do you have m6A output >data from JACUSA2 and JACUSA2helper pipeline that use the minimum number of fast5 datasets? You should be able to use the BAM files to get the same results. I don't understand the problem? Is there a problem with the provided BAM files? Of course you could start with the FAST5 files and use the Guppy command I posted in a previous message.

Thank you for your feedback.

kwonej0617 commented 1 year ago

Got it. Thank you for your answer!

kwonej0617 commented 1 year ago

By the way, could you let me know which version of guppy did you use in base-calling? Also, could you provide me the kit # and flowcell for samples as follows used in base-calling? HEK293T-WT-0-rep1 HEK293T-WT-0-rep2 HEK293T-WT-100-rep1 HEK293T-WT-100-rep2 HEK293T-WT-100-rep3

Thank you!

piechottam commented 1 year ago

guppy/5.0.11,
FLO-MIN106, and
SQK-RNA002

kwonej0617 commented 1 year ago

Hi @piechottam I successfully reproduced your work using the data and code provided. Recently, I have run JACUSA2 using public data and tried to run JACUSA2helper to detect m6A sites.

Here is my output of JACUSA2.

## JACUSA2 Version: 2.0.0-RC22 call-2 -m 1 -q 1 -c 4 -p 40 -D -I -a D,Y -P1 FR-SECONDSTRAND -P2 FR-SECONDSTRAND -r WT_KO_call2.out /home/euijin.kwon-umw/Euijin/m6a_tool_comparision/data/minimap2/HEK293T-rep1/wt.bam /home/euijin.kwon-umw/Euijin/m6a_tool_comparision/data/minimap2/HEK293T-rep1/ko.bam
#contig start   end     name    score   strand  bases11 bases21 info    filter  ref
ENST00000361390 5       6       call-2  0.05854015216027619     +       0,129,0,1       0,227,0,1       del11=0,130;del21=3,231;deletion_pvalue=0.4288244599213218;deletion_score=0.6260051196441054;ins11=0,130;ins21=1,231;insertion_pvalue=0.9334599590242678;insertion_score=0.006970993708819151   *       C
ENST00000361390 6       7       call-2  0.24858149843021238     +       123,0,0,0       215,2,0,1       del11=7,130;del21=13,231;deletion_pvalue=1.0;deletion_score=-0.0044531128369271755;ins11=0,130;ins21=2,231;insertion_pvalue=0.6335810689716541;insertion_score=0.22723578684963286      *       A
ENST00000361390 7       8       call-2  0.06676272628828883     +       0,0,0,128       0,1,0,234       del11=4,132;del21=3,238;deletion_pvalue=0.22346263534874455;deletion_score=1.4819950729142874;ins11=2,132;ins21=5,238;insertion_pvalue=0.8832268770801001;insertion_score=0.02157368278130889   *       T
ENST00000361390 8       9       call-2  0.16147901761723915     +       3,0,131,0       7,0,229,1       del11=2,136;del21=5,242;deletion_pvalue=0.8676207277537094;deletion_score=0.027782694902271032;ins11=3,136;ins21=1,242;insertion_pvalue=0.12832534346104396;insertion_score=2.312647156231833   *       G

Since I used transcript reference in running JACUSA2, the generalized code in https://dieterich-lab.github.io/JACUSA2helper/articles/web_only/JACUSA2helper-nanopore.html doesn't work in several parts such as filter(seqnames [%in%] c ([as.character](1:22), "MT", "X")) and [seqlevels](result, pruning.mode = "coarse") <- c(1:22, "MT", "X") . I am assuming those parts are to filter out the JACUSA2 output which belongs to chromosomes 1-22, MT, and X. Since my data have transcription levels, I would like to change the code and take all JAUCSA2 output in downstream analysis. Could you provide me with a code that works for the transcripts level?

Thank you so much!

piechottam commented 1 year ago

You should be able to skip/remove the chr-specific code parts. The code parts were necessary, to filter/remove contigs (and retain only chromosomes). Lateron in the analysis, when predictions are compared against miCLIPdata GenomicRanges? would complain about different seqlevels. Since you work on transcript level (and you are not planning to compare against miCLIP data) - it is safe to remove the code parts.

dieterich-lab / JACUSA2helper

m6a detection in nanopore sequencing data #20