524D / compareMS2

Compare samples by MS2 spectra
MIT License
3 stars 0 forks source link

Unexpected behavior (partial trees/labels appear incorrect) #6

Closed magnuspalmblad closed 3 years ago

magnuspalmblad commented 3 years ago

I noticed that while analyzing a set of 72 LC-MS/MS datasets of primate sera, with E. coli as an "outgroup" that in the beginning of the analyses, some E. coli samples were grouped with primate sera in the first trees. While the first trees may differ from the final one, the E. coli is really so different they should never go with the sera. Is it possible there is some shift or error in the matching of the sample names/labels and tree nodes? It does converge to a more expected results, but this could theoretically happen even if there is a bug (the effect of which is diluted with more correctly aligned comparisons and labels.

magnuspalmblad commented 3 years ago

See this screenshot. Sample S01539 is a chimpanzee serum, and should not be grouped with S01540, which is E. coli. It looks like there is often one sample/label that is misplaced (until the very end). Does that help identify the bug?

magnuspalmblad commented 3 years ago

In the final tree displayed, the S01539 is missing from the tree, though both datasets are in the MEGA file.

524D commented 3 years ago

The reason for this is that the compareMS2_to_distance_matrix executable does not handle the situation where the "sample-to-species" file contains more entries than have have been processed. In that case, the list at the start of the Mega file seems to contain all species/filenames. Probably related is that under the same conditions, compareMS2_to_distance_matrix sometimes causes output where the distance matrix contains empty lines, causing the GUI to draw an incomplete tree. This can be solved by always offering compareMS2_to_distance_matrix a "sample-to-species" file that exactly matches the available results, or by making compareMS2_to_distance_matrix more tolerant. The first solution seems like a kludge, because it requires the GUI to parse/process/filter the "sample-to-species" file that it otherwise has no business with. Therefore, I will try to add the change to compareMS2_to_distance_matrix.

524D commented 3 years ago

This should work correctly with version 0.0.3. There are still two E. coli samples that are grouped with the sera (filenames ending with _1152 and _1153). Inspecting the output of the compareMS2 command line executable shows that indeed the fraction_gt_cutoff is larger between those files and the nearest serum file (filename ending with _0134) than compared to the other E. coli files. Maybe those files contain contaminations.