AlexsLemonade / alsf-scpca

Management and analysis tools for ALSF Single-cell Pediatric Cancer Atlas data.
BSD 3-Clause "New" or "Revised" License
0 stars 1 forks source link

Add mpileup workflow #148

Closed jashapiro closed 2 years ago

jashapiro commented 2 years ago

~Stacked on #147~

This PR continues to add steps for #127.

Here I am adding processing a set of bulk samples that have been mapped with STAR to identify variants in a subset of samples.

There are two major areas of work here:

  1. The workflow logic to set up the input data files, identified by the samples that comprise a multiplexed library.
  2. The mpileup process itself.

Workflow logic

For the first part, there is a fair amount of jiggering to get the channels into the correct configuration. For this I use some map{} and transpose() transformations to pull out various parts and rearrange them. transpose() may not be familiar (it was not to me!), but I think of it as a kind of equivalent to pivot_longer in the tidyverse, or maybe more tidyr::separate_rows: When there is a tuple element, it separates that into separate entries in the channel. So I use that to get every sample & mulitplexed library meta object as pairs for later.

Once that is done, I join on the bulk results by sample_id with combine(). Here I am using some "dummy" data in that it is a channel that comes from data entered in the workflow script, which refers to previous results that I have calculated. In the future, this would be the result of output from map_bulk_star workflow, so it should mirror that with [meta, bamfile, bamindex] per element of the channel.

After joining, I combine by the multiplex meta objects (which means I get one per run, as they were identical), then restructure again to get an input for the mpileup process with a new meta object which contains tuples for the sample_ids and other bulk info, but only the single multiplex library id and run ids. I didn't think having all of the metadata from each file was necessary at this point, but I could extend the number of fields we keep.

I tried to comment this section pretty well, but if there are sections of the logic that are not clear, please let me know.

mpileup/bcftools parameters

For mpileup, I used what I would consider a pretty standard set of options to get to a compressed vcf file with samples labeled by sample_id (rather than the filenames that would be the default). I used mostly the same procedures as https://github.com/lmweber/snp-dmx-cancer/blob/master/genotype/genotype_bulk_bcftools/genotype_bulk_HGSOC_bcftools.sh but added a step to remove sites with missing genotype calls. My thought here was that if a site is not called in a bulk sample, the chance that it is called in a single cell sample is slim, and the ambiguity of those sites where we don't know at least one sample's genotype will make them less useful for demultiplexing down the line. By some quick tests that I did on a subset of the data, it looked like this could reduced the file size a fair bit. If later testing reveals that we are not getting good demultiplexing calls, I may revisit this decision to see if it makes a difference.

The mpileup process takes about 3-4 hours to run for the whole genome, but it is essentially single-threaded. I use two cpus to give a bit of headroom for the piped steps, and the memory requirements are quite slim. This could potentially be sped up by splitting chromosomes into separate processes, but that would require more complexity. It isn't so slow that it seems unacceptable to me, but that is a potential future improvement.