talnor commented 8 months ago

Description

Summary of the changes made:

Changed assembly command to exclude the --careful flag by default.
Change assembly command to include the --isolate
Updated spades assembler to v3.14.0

Fix for #164.

Primary function of PR

[ ] Hotfix
[ ] Patch
[ ] Minor functionality improvement
[ ] New type of analysis
[x] Backward-breaking functionality improvement
[x] This change requires internal documents to be updated
[ ] This change requires another repository to be updated

Testing

If the update is a hotfix, it is sufficient to rely on the development testing along with the Travis self-test automatically applied to the PR.

Test routine to verify the stability of the PR:

us
install branch in stage
cg workflow microsalt start vocalcoral

Test results

These are the results of the tests, and necessary conclusions, that prove the stability of the PR.

Sign-offs

[x] microSALT version is updated in microSALT/init.py
[ ] Code tested by @octocat
[ ] Approved to run at Clinical-Genomics by @sylvinite

talnor commented 7 months ago

Failed samples pass with updating settings. Typing of other samples remain the same.

v.3.3.6

v.3.4.0

karlnyr commented 3 weeks ago

Extracting loci files by running:

for run_name in $(cat test_run_names.txt);
    do
        name_pref=$(echo $run_name | cut -d _ -f 2,3);
        project_id=$(echo $run_name | cut -d _ -f 1);
        for sample_dir in /home/proj/stage/microbial/results/"${run_name}"/"${project_id}A"*;
        do
            sample_id=$(basename $sample_dir);
            for loci_file in "${sample_dir}"/blast_search/mlst/*;
            do
                file_name=$(basename $loci_file);
                cp "${loci_file}" "/home/proj/stage/microbial/validation/4.0.0/loci_files/${name_pref}_${sample_id}_${file_name}";
            done;
        done;
    done

This script will copy all loci files from the blast_search/mlst directory of each sample to the validation directory. The loci files will be renamed to include the run name, sample ID, and the original file name.

Created a Python module which extracts loci data into a database, can extract blast hits, rank said blast hit similar to microSALT, and report if two blast files have different top hits, available in:

/home/proj/production/dropbox/KN/validations/micosalt_4.0.0_validation/loci_comparison/loci_comparison/main.py

Created a script to run the loci comparison between two runs for each sample for each loci:

#!/bin/bash

# Directory where loci files are stored
loci_dir="/home/proj/stage/microbial/validation/4.0.0/loci_files"

# Python binary and script path
python_bin="/home/karl.nyren/.conda/envs/D_KN/bin/python"
script_path="/home/proj/production/dropbox/KN/validations/micosalt_4.0.0_validation/loci_comparison/loci_comparison/main.py"

# Get a list of unique sample names
samples=$(ls "$loci_dir" | awk -F'_' '{print $3}' | sort | uniq)

# Iterate over each sample
for sample in $samples; do

    # Get a list of unique loci types for the current sample
    loci_types=$(ls "$loci_dir"/*_"$sample"_loci_query_*.txt | awk -F'_' '{print $7}' | cut -d '.' -f 1 | sort | uniq)

    # Iterate over each loci type for the current sample
    for loci in $loci_types; do

        # Get list of loci files for the current sample and loci type
        loci_files=$(ls "$loci_dir"/*_"$sample"_loci_query_"$loci".txt)

        # Compare each loci file against all others for the same sample and loci type
        for file1 in $loci_files; do
            for file2 in $loci_files; do
                if [ "$file1" != "$file2" ]; then
                    "$python_bin" "$script_path" "$file1" "$file2"
                fi
            done
        done
    done
done

And then ran the script which generated the following output:

[09:32] [karl.nyren@hasta:/home/proj/stage/microbial/validation/4.0.0] [base]  $ grep FAIL compare_loci.log
FAIL,2024.6.20_16.24.4_ACC5551A9_loci_query_aroE.txt: hit_id='aroE_1228' distance_from_middle=0 percent_identity=100.0 alignment_length=380 alignment_span=0.8333333333333334, 2024.8.1_16.59.51_ACC5551A9_loci_query_aroE.txt: hit_id='aroE_1232' distance_from_middle=0 percent_identity=100.0 alignment_length=380 alignment_span=0.8333333333333334
FAIL,2024.6.20_16.24.4_ACC5551A9_loci_query_tpi.txt: hit_id='tpi_993' distance_from_middle=116 percent_identity=100.0 alignment_length=334 alignment_span=0.8308457711442786, 2024.8.1_16.59.51_ACC5551A9_loci_query_tpi.txt: hit_id='tpi_1004' distance_from_middle=116 percent_identity=100.0 alignment_length=334 alignment_span=0.8308457711442786
FAIL,2024.5.28_14.15.27_MIC3557A61_loci_query_tpi.txt: hit_id='tpi_964' distance_from_middle=19 percent_identity=100.0 alignment_length=241 alignment_span=0.599502487562189, 2024.6.20_15.3.38_MIC3557A61_loci_query_tpi.txt: hit_id='tpi_994' distance_from_middle=19 percent_identity=100.0 alignment_length=241 alignment_span=0.599502487562189
FAIL,2024.5.28_14.15.27_MIC3557A62_loci_query_tpi.txt: hit_id='tpi_964' distance_from_middle=19 percent_identity=100.0 alignment_length=241 alignment_span=0.599502487562189, 2024.6.20_15.3.38_MIC3557A62_loci_query_tpi.txt: hit_id='tpi_994' distance_from_middle=19 percent_identity=100.0 alignment_length=241 alignment_span=0.599502487562189

Some files were missing, meaning that we were not able to extract any data from said loci. The missing data is consistent between the two modes, so there is no concern there. For MIC3557A61, MIC3557A62, and ACC5551A9, the tpi loci are not the same between the two modes. However, the hits are not good either way, with a low alignment span, we would discard, as we require our hits to have at least >90% of the loci aligned.

Clinical-Genomics / microSALT

Use of --isolate flag, and update of SPAdes assembler #165