running into problem with Xpore after merging multiple .fastq and .bam files from same Nanopore run

annapopova-scripps commented 1 year ago

Dear Xpore developers,

Thank you for the software. It was easy to install and handle , until I decided to merge all 50+ .fastq (and .bam) output files from one nanopore run. Each .fastq roughly has 3-4K reads and on its own produces meaningful Xpore output table, when running in comparative mode. Then, I decided to merge five different .fastq (using cat) and .bam (using samtools merge), and no problem was met at the nanopolish index and nanopolish eventalign steps, consistently I observed that eventalign.txt was five times larger than for a single .fastq. However, Xpore dataprep and Xpore diffmod returned dataprep folder and diffmod.table as if a single .fastq was used.

I am currently limited to adding all individual .fastq files as replicates (below), which is stupid. So, I hope, you may have a good recommendation for the situation.

Anna

data: IVT: REP1: ./IVT_dataprep/IVT_1_dataprep REP2: ./IVT_dataprep/IVT_2_dataprep REP3: ./IVT_dataprep/IVT_3_dataprep REP4: ./IVT_dataprep/IVT_4_dataprep REP5: ./IVT_dataprep/IVT_5_dataprep REP6: ./IVT_dataprep/IVT_6_dataprep REP7: ./IVT_dataprep/IVT_7_dataprep REP8: ./IVT_dataprep/IVT_8_dataprep REP9: ./IVT_dataprep/IVT_9_dataprep REP10: ./IVT_dataprep/IVT_10_dataprep

yuukiiwa commented 1 year ago

Hi @annapopova-scripps,

Do you mind sharing the first 10 lines of the eventalign.txt file, please?

I suspect that some entries in the eventalign.txt file have the transcript/gene version while some do not. xpore strips off the gene/transcript version which may explain why it is the size of a single fastq/bam diffmod.table.

Thanks!

Best wishes, Yuk Kei

annapopova-scripps commented 1 year ago

Dear Yuk Kei,

Here are the top 10 lines of the eventalign.txt for the five .fastq files I merged.

Thank you for looking into the issue,

Anna

anna@supernova:~/ONT_files/ONT_runs/nanopolish/Xpore_1/70S$ head 70S_10to14_eventalign.txt contig position reference_kmer read_index strand event_index event_level_mean event_stdv event_length model_kmer model_mean model_stdv standardized_level start_idx end_idx RRSB-RRNA 9 AGTTT 19 t 1 111.91 8.435 0.00199AGTTT 124.34 7.49 -1.46 65656 65662 RRSB-RRNA 9 AGTTT 19 t 2 128.19 5.800 0.01527AGTTT 124.34 7.49 0.45 65610 65656 RRSB-RRNA 10 GTTTG 19 t 3 79.84 1.955 0.01560GTTTG 78.97 1.97 0.39 65563 65610 RRSB-RRNA 11 TTTGA 19 t 4 89.31 2.765 0.01394TTTGA 89.94 2.65 -0.21 65521 65563 RRSB-RRNA 11 TTTGA 19 t 5 88.84 1.758 0.00764TTTGA 89.94 2.65 -0.37 65498 65521 RRSB-RRNA 11 TTTGA 19 t 6 88.55 1.466 0.00963TTTGA 89.94 2.65 -0.46 65469 65498 RRSB-RRNA 12 TTGAT 19 t 7 105.18 2.992 0.00564TTGAT 108.32 4.49 -0.62 65452 65469 RRSB-RRNA 12 TTGAT 19 t 8 103.04 1.315 0.00365TTGAT 108.32 4.49 -1.04 65441 65452 RRSB-RRNA 12 TTGAT 19 t 9 107.48 2.771 0.00232TTGAT 108.32 4.49 -0.16 65434 65441

yuukiiwa commented 1 year ago

Hi @annapopova-scripps,

Sorry for the delayed reply!

I supposed if your eventalign.txt is 5 times larger, then your dataprep is also 5 times larger. Still, when you are comparing the two conditions, the sites that differ are the same, which gives you a diffmod.table with a similar size to using a single .fastq file.

Thanks!

Best wishes, Yuk Kei

annapopova-scripps commented 1 year ago

Hi Yuk Kei,

My eventalign.txt is 5 times larger (5x), BUT dataprep is only 1x, and clearly in the diffmod.table the coverage never exceeds 1000. In fact, from a single .fastq I should get a coverage of ~1500, and from five .fastq files ~7500 per site.

One thing I noticed is that every data.readcount from a single .fastq is the same: idx,n_reads RRSB-RRNA,1001 RRLB-RRNA,1001 Which makes me suspicious that dataprep has a cut off of 1000 per gene. No change every fastq. has same number of reads :)
Do you think this is true?

Best, Anna

yuukiiwa commented 1 year ago

Hi Anna,

The --readcount_max [READCOUNT_MAX] is set to 1,000 reads by default. You can customize it by passing the --readcount_max [READCOUNT_MAX] when running xpore dataprep.

Thanks!

Best wishes, Yuk Kei

annapopova-scripps commented 1 year ago

Hi Yuk Kei,

Thank you for the feedback, that fixed the issue and things look consistent now :)

I want to bother you for few more moments:

1) I noticed that --n_processes option neither in xpore dataprep nor in xpore diffmod when specified does not engage more than one processor in Linux. Did you guys have a chance to evaluate the parallel version of xpore?

2) I have spent time to play with positions returned in the diffmod.table and got an idea how to use z_score and diff_mod_rate combination to filter out most modification and variant positions in the set of control samples. Still, I am worried that diffmod.table contains 10-30 more entries than var/mod positions, and that positions that are known to be unmodified have distinct mu_umodand mu_mod values, and significant individual mod_rate values (as high as 0.9). To test this, I took an in-vitro transcribed dataset, split it into two datasets and ran xpore in a comparative mode. There were hundreds of entries in the diffmod.table with apparently bimodal distribution of ion current detected by Xpore. This is quite alarming to me, and I am wondering if something can be done to adjust the reference model for that or addressed otherwise to make Xpore more sensitive to real modified positions and quantify stoichiometry.

Sincerely,

Anna

yuukiiwa commented 1 year ago

Hi Anna,

Because of io, --n_processes uses at most 2 cpus in xpore dataprep. Still, parallel processing should work well in xpore diffmod from my experience.
Thanks for raising this! I will discuss with @ploy-np on this.

Thanks!

Best wishes, Yuk Kei

GoekeLab / xpore

running into problem with Xpore after merging multiple .fastq and .bam files from same Nanopore run #179