BiodataAnalysisGroup / UMIc

A framework implementing a method for UMI deduplication and reads correction.
MIT License
8 stars 4 forks source link

UMI in the read header but not in the sequence #7

Closed goyastephanie closed 4 days ago

goyastephanie commented 1 year ago

Hi developers, I have UMI data run as paired-end 2x150 with 11nt UMI attached to the 3' end of the i7 index. I want to use UMIc but I have no chance for getting the UMI attached to the start of the read. I have the UMI data in the read header after basecalling with BCL Convert, e.g: @M04xxx:2x6:000000000-L5xxB:1:1101:15xxx:xx50:ATGTGTTGAGT 1:N:0:CAGGTTCA+ATAACGCC

Is there any way to run UMIc identifying the UMI from the read header? If this is not possible, do you know any way to attached the UMI from the read header in the start of the sequence? Thank you!! Stephanie

npechl commented 1 year ago

Hi @goyastephanie,

Thank you for your comment! To gain a clearer understanding of your question, could you kindly provide more details about the orientation of the reads? In particular, does your data follow the UMIc assumption of UMIs being positioned at the start of each sequence as well as having UMI-tagged libraries (direct stranded sequences, link)?

Given this context, addressing your question, UMIc currently caters to three specific usage scenarios:

  1. paired-end libraries with UMIs in R1,
  2. paired-end libraries with UMIs in both R1 and R2, and
  3. single-end libraries.

Consequently, it may not be suitable for your particular scenario. A potential solution to consider is your suggestion of appending UMIs from the read header to the sequence's beginning. This can be accomplished within the R environment using the ShortRead package. The procedure involves extracting UMIs from the sequence headers, incorporating this information into the sequences, and subsequently generating corrected fastq files. However, it's worth noting that this approach could be influenced by your file sizes.

Kind regards, Nikos

goyastephanie commented 1 year ago

Thanks for answering @npechl ! The UMI is at the end of the i7 index (at the 3' end of the target). Thus the UMI is read within the index cycles and not during the target sequence cycles. That's why I couldn't get the UMI sequence in the start of the read with any basecaller. I can have the UMI sequence in the read header during UMI basecalling identification. I can try with ShortRead package, but what specific tool you suggested? I couldn't find it from the reference manual. Kind regards, Stephanie

npechl commented 1 year ago

The utilization of the ShortRead package involves a certain degree of manual programming to integrate UMI sequences at the outset of the reads. Initially, you have to use the readFastq function to read fastq files. This action generates an individual object for each fastq file, containing both the sequences and headers. Subsequently, you have to use string manipulation methods, maybe with the stringr package or other available alternatives, in order to extract the UMI sequences from the headers and concatenate them to the beginning of the sequences. Following this process, the refined sequences can be written into new files using the writeFastq function.

I'd like to note that I am a little bit confused regarding the positioning of the UMIs. To make it clearer for you, UMIc undertakes reading of the fastq file and captures the initial xx base pairs (dependent on user-defined input) from the beginning of the sequences.

With these explanations, I hope the process is now clearer for you!

Regards, Nikos