MSG can't process paired-end sequences

Hi MSG team,

I am currently working with paired-end sequencing data. Because my fastq files were already deconvoluted, I skipped the first step of preprocessing/deconvoluting fastq files. I only reformated my fastq files (R1 and R2) to have single fastq files with interleaved reads for each sample. The files were renamed to follow MSG naming rule (indiv_[sample_name]_barcode) and an empty read.fq file was created. While the alignment step works without problem, the pipeline chokes when removing non common reads. The problem comes from the fact that each pair-end sequence has the same ID. When the reads are processed by extract-ref-alleles.py, the SAM file is first analyzed for the presence of multiple occurrences of the same ID as a proxy of reads aligning to multiple locations. This is done by listing the read ID of each SAM entry and by discarding any entries that appear twice. The problem is that for paired-end reads, the read ID appears at least twice. So all reads are therefore discarded.

As a quick and dirty patch, I edit two lines of the remove_non_matching_reads function of the extract-ref-alleles.py script: https://github.com/JaneliaSciComp/msg/blob/e6fedcb88f51d7f7a876f7f2c62f18b96dd06e6b/extract-ref-alleles.py#L128 https://github.com/JaneliaSciComp/msg/blob/e6fedcb88f51d7f7a876f7f2c62f18b96dd06e6b/extract-ref-alleles.py#L133

A simple change from 1 to 2 solved it. However, this is obviously not compatible with the initial intent when working with single-end sequencing data. To be more efficient, an evaluation should probably be done depending on single-end reads (e.g., sim_reads[read.query_name] == 1) or paired-end reads (e.g., sim_reads[read.query_name] <= 2). The use of <= in the second case will be needed because if reads are preprocessed/deconvoluted as specified in the toy example, this can lead to discarding one read of the pair in some cases and therefore the read ID would appear only once.

In addition of modifying scripts, this would also require identifying use of paired-end reads. Three possible solutions:

Having a parameter in the msg.cfg file to specify if paired-end sequencing is used.
Detect the presence of paired-end reads from the fastq files and store the information in a file (this way, no need for user input). In the case of interleaved file, this is very easy as the sequence ID will be of the form: @A00740:72:HWFVNDSXX:3 1:XX... @A00740:72:HWFVNDSXX:3 2:XX... However, this will not work if the user only concatenate the two reads in one file without interleaving. This would require analyzing the occurrence of each ID (or a subset of them).
Detect the presence of pair-end reads from the SAM file by listing the minimal occurrence of read ID: if a minimum of 1 is detected, this would mean that single-end reads are used, if a minimum of 2, this would mean paired-end reads. This would avoid storing the information in the file as mentioned above. However, I don't know how computationally efficient this approach would be. This could be done on a subset of ID. In addition, if reads are preprocessed/deconvoluted and some paired-end reads are turned single-end reads, this could lead to false single-end read detection.

In my opinion, paired-end sequencing is likely to become the norm if not already. It is relevant to adapt the pipeline for processing such reads.

If there is any future solution you would like me to test, please let me know.

JaneliaSciComp / msg

MSG can't process paired-end sequences #46