Retaining sample origin information

pakiessling commented 6 months ago

Hi, I am trying out simpleaf for the first time and I was very happy with how fast the alignment went. I processed a directory with 15 samples like this:

reads1_pat="_1.fastq.gz"
reads2_pat="_2.fastq.gz"

# Obtain and sort filenames
reads1="$(find -L ${FASTQ_DIR} -name "*$reads1_pat*" -type f | sort | awk -v OFS=, '{$1=$1;print}' | paste -sd, -)"
reads2="$(find -L ${FASTQ_DIR} -name "*$reads2_pat*" -type f | sort | awk -v OFS=, '{$1=$1;print}' | paste -sd, -)"

# simpleaf quantfication
simpleaf quant \
--reads1 $reads1 \
--reads2 $reads2 \
--threads 24 \
--index $IDX_DIR/index \
--chemistry 10xv2 --resolution cr-like \
--expected-ori fw --unfiltered-pl \
--t2g-map $IDX_DIR/index/t2g_3col.tsv \
--output $OUTPUT_DIR

The problem I am having is that the output doesnt seem to contain any information from which file what matrix entry originated. What is the best way to deal with this? Loop over the read files and write to separate output folders? I assume that might lead to some performance penalty? Or compare barcodes of the matrix to input R2 files? Unsure how to tackle that efficiently.

rob-p commented 6 months ago

Hi @pakiessling,

Apologies for the slow response — I was away at RECOMB last week and so have been catching up on e-mail and GitHub issues.

To your point, the easiest way to deal with this would be, as you suggest, to loop over the different samples in e.g. a bash loop, and simply invoke simpleaf for each sample with a distinct output directory. Processing each sample separately is likely to slow things down a little bit since, e.g. the barcode correction will be done independently for each input sample rather than once for all reads. However, (a) this is the right thing to do ;) and (b) I don't suspect the performance penalty will be too bad at all, as the barcode correction step is quite quick, and the actual process of aligning the samples separately shouldn't really take any longer than aligning all the reads in one go.

If it's of interest and may be useful in the future, we might consider adding some functionality to allow one to directly provide e.g. a sample sheet with one row per sample, and to have the looping over the samples handled internally via simpleaf. In fact, something like this can already be accomplished pretty easily with our simpleaf workflow module (cc @DongzeHE to chime in on that), but might be something common enough to warrant it's own quant option.

Best, Rob

pakiessling commented 6 months ago

Ty! I ended up doing this:

reads1=($(find -L ${FASTQ_DIR} -name "*${reads1_pat}*" -type f | sort))
reads2=($(find -L ${FASTQ_DIR} -name "*${reads2_pat}*" -type f | sort))

# Ensure same number of read1 and read2 files
if [[ ${#reads1[@]} -ne ${#reads2[@]} ]]; then
    echo "Error: The number of read1 and read2 files does not match."
    exit 1
fi

# Loop over each pair of read1 and read2
for ((i=0; i<${#reads1[@]}; i++)); do
    read1=${reads1[$i]}
    read2=${reads2[$i]}
    echo "Processing pair: ${read1} ${read2}"

    base_name=$(basename "${read1}" "${reads1_pat}")
    pair_output_dir="${OUTPUT_DIR}/${base_name}"

    mkdir -p "${pair_output_dir}"

    simpleaf quant \
    --reads1 "${read1}" \
    --reads2 "${read2}" \
    --threads 24 \
    --index "${IDX_DIR}/index" \
    --chemistry 10xv2 --resolution cr-like \
    --expected-ori fw --unfiltered-pl \
    --t2g-map "${IDX_DIR}/index/t2g_3col.tsv" \
    --output "${pair_output_dir}"
done

Which seems to work well

COMBINE-lab / simpleaf

Retaining sample origin information #136