MGPC-Nantes / MEM

Motivation: Microsatellites are DNA sequences formed by a continuous repetition of patterns from 1 to 6 nucleotides. Deficient mismatch repair system (dMMR) induces a variation in length of microsatellites called microsatellite instability (MSI). However, MSI assessment by Next-Generation Sequencing (NGS) is difficult because replication errors occurring during the amplification steps of the sequencing process themselves induce variation in microsatellite sequence length. Results: The MSI assessment by Expectation-Maximization (MEM) algorithm attempts to closely replicate the reference PCR interpretation method for 5 microsatellites validated by the Bethesda and ESMO international guidelines (BAT-25, BAT-26, NR-21, NR-24, and NR-27). MEM identifies the stable or unstable nature of each microsatellite i- by determining the length distribution of microsatellite sequences from unmapped and quality unfiltered paired-end reads using the Smith-Waterman alignment, and ii- by determining whether the observed distribution is comparable to a reference distribution (stable) or corresponds to a mixture model, i.e. the mixture of several sub-distributions of different mean lengths (unstable) using Expectation-Maximization algorithm.
Other
2 stars 1 forks source link

There was a problem attempting to run MEM #2

Open observer2735 opened 11 months ago

observer2735 commented 11 months ago

Hello,

I am trying to run my MSI dataset using MEM. Now I have downloaded MEM, but I don't know how to use it. After searching the global, I'm not sure where the main function is, and there are no instructions in readme. If I want to run my MSI dataset through source code construction, what should I do? What parameters and commands should be entered?

Thank you for your reply

MGPC-Nantes commented 11 months ago

Hello,

MEM is a plugin for QIAGEN's CLC Genomics Workbench software. Do you have this software?

Best regards

Programjackanapes commented 11 months ago

Hello,

I don't have this software, and I'm trying to reproduce this code by recompiling the source code of this software. After I read the source code, I found that this code maybe a tool class in other software. Could you please give me a way to use "QIAGEN's CLC Genomics Workbench " (I don't know what is it)? or, I found that the way to call MSI status in the artical is from FASTQ to feature, can you give me some commands to reproduce your work?

Sorry to bother you, wishing you a happy life!

MGPC-Nantes commented 11 months ago

CLC Workbench is a software for sequence analysis. See : https://digitalinsights.qiagen.com/products-overview/discovery-insights-portfolio/analysis-and-visualization/qiagen-clc-genomics-workbench/

The first part of the analysis consists of determining the size distribution of the microsatellite sequences, directly from the unaligned reads. This is done using built-in functions in the CLC Genomics Workbench software, included in a workflow attached to the Github repository.

The second part, carried out by the plugin itself, involves the interpretation of the distributions obtained, using an expectation-maximisation algorithm.

We are currently developing a Python version combining these two parts.

Programjackanapes commented 11 months ago

Hello,

Thank you very much for your reply. I immediately understood how to use the MEM plugin. Your answer was very, very professional!

Now I'm worried about another thing, that is I found that right now in your supplementary materials that if I want to reproduce your code, I need to have trimmed sequence at PCR. This is a thorny issue: first, I don't know if different dataset, different PCR processing have same trimed sequence at 3' and 5' (I'm not clear about the PCR process; second, although there are PCR sites on my NGS sequence data, I don't have PCR output ,so I cannot compare the PCR verification results with the results calculated by MEM software for verification.

This may be a very unreasonable request, I noticed in your article that “The data presented in this study are available on request from the corresponding author”, so I would like to ask if I can apply for the NGS sequencing data and corresponding PCR data used for calculation in your article to verify your MEM results?

I assure you that the purpose of requesting data is solely for academic purposes and will not engage in any commercial activities. If I need to use your data in the future, I will definitely apply for your consent.

Thank you very much for your timely reply! Wishing you a pleasant work and smooth life!

observer2735 commented 11 months ago

Hello,

I have anohter problem about MEM, not about dataset request : ) .

In the paper, I noticed that it's has a scaling function named S(x,m), and I found it in the source code is that:

shiftedX = (double) x * theoSize / size;

and to use parameter shiftedX, it has following:

(shiftedX - (double) referenceX)

I don't understand why should calculate shiftedX? In the paper, it only call it named "scaling function", but I don't know why should use it and how to use. And the difference of shiftedX and referenceX I can understand a little, it may calculate the expected value(referenceX) and truth value(shiftedX)?

Looking forward to your reply!

MGPC-Nantes commented 11 months ago

Hello,

When a microsatellite is unstable, the size distribution of its sequences follows a "mixing model", i.e. we observe a distribution made up of several sub-distributions, one for each allele of different size, contributing to the mixing model in a proportion equal to its allelic fraction.

MEM uses an expectation-maximization algorithm to build this mixture model. We considered that each sub-distribution had the same "shape" as the unimodal distribution observed for the stable microsatellite, and that only its scale changed. For example, if a microsatellite has an unstable allele 10bp shorter than the reference size, then by scaling the reference distribution to -10bp, we obtain a distribution whose mean is 10bp shorter than the reference size, and with less dispersion (variance), which is fairly consistent with what is observed in practice. This function enables us to parameterize each sub-distribution using just two parameters: the allele size (determines the position of the sub-distribution, and its dispersion through scaling) and its allelic fraction (determines the proportion of the sub-distribution in the final mixing model).

An intermediate step remains: to determine the reference distribution of each microsatellite, we used the distributions observed on several known MSS samples. However, this method only gives a discrete distribution, whereas the expectation-maximization algorithm rather constructs the mixing models using continuous sub-distributions. We must therefore transform the discrete reference distributions obtained on MSS samples into continuous distributions using a kernel estimation function before their use.

I hope that these explanations will be able to answer your questions, do not hesitate to let me know if it is not clear.

Regarding the provision of our datasets, it is difficult to provide you with all the FASTQs used in the study, both in terms of feasibility (voluminous data+++) and in terms of regulations (I would need an authorization of the French authorities for digital confidentiality, because it concerns patient data). Alternatively, if you just want to compare our methods, I can offer you either:

Best regards