simplify how samples are specified

aryarm commented 3 years ago

Come up with an intuitive data structure for internal use, and then parse into it from the following options:

Specify as a samples.tsv file
Specify as a custom user-provided py script
Specify as yaml containing strings as paths with wildcards in them

You could create a parser script for this.

aryarm commented 3 years ago

There are a total of 8 different ways to configure this depending on config options like asoc, unpaired, rna_only, and interleaved! We need to simplify this.

aryarm commented 3 years ago

@AaronJeeHo and @zrcjessica

Going forward, I think the easiest way to start would be to implement the third option above (ie specify as yaml containing strings as paths with wildcards in them), and then we can think about including the other two options depending on how we feel...

Basically, I'm thinking we could add a config option called data that gets parsed into a dictionary. Each entry in data could be a string for the path to the FASTQ or BAM files. And then we could have a wildcard within that string for the sample name. We could label the keys in the data dictionary by the names that we would have in the samples file (ex: dna_bam_path), and we could use those keys to determine which of the 8 config options we're dealing with. I'm not sure how we would map between the sample ID and the VCF sample ID, but one idea could be to have a config option that specifies the path to a file with those two things as the two columns in the file (like the original samples file)?

aryarm commented 3 years ago

Here are some examples for the possible contents of a data config option in a new samples.yml file. I'm thinking any of these would be considered valid samples.yml files. In each case, the sample ID is inferred by matching the {sample} wildcard.

ex1

data:
    wgs: path/to/the/wgs/files/{sample}.other.things.in.the.fname.bam # this is the dna bam
    rna:
        path: path/to/the/rna/files/{sample}.other.things.fastq.gz
        unpaired: True
    vcf_id: path/to/sample/id/mappings.tsv

ex2

# Since the unpaired config option isn't provided here, we could just assume it is paired by default (ie unpaired = False by default)
# Also, aln_cmd is a required input if the RNA or ATAC data is in BAM format. We could also try to parse this out of the contents of the file, itself.
data:
    rna:
        path: path/to/the/rna/files/{sample}.other.things.bam
        aln_cmd: "bwa mem -M {input.ref} {input.fastq} > {output}"
    vcf_id: path/to/sample/id/mappings.tsv

ex3

# In this case, we don't have any other config parameters (ie aln_cmd or unpaired), so we just set the entire value of the rna dict to a single string (ie the path)
# Since unpaired defaults to False, this situation would imply interleaved is True
data:
    rna: path/to/the/rna/files/{sample}.other.things.fq.gz
    vcf_id: path/to/sample/id/mappings.tsv

ex4

# In this case, the VCF ID is the same as the sample ID except that the VCF ID doesn't have "ID-" preceding it, so we just specify a regex expression to extract that
data:
    rna: path/to/the/rna/files/{sample}.other.things.fq
    vcf_re: '(?<=ID-).*$'

ex5

data:
    wgs: path/to/the/wgs/files/{sample}.other.things.in.the.fname.bam
    atac: [path/to/the/atac/files/{sample}.other.things.1.fq.gz, path/to/the/atac/files/{sample}.other.things.2.fq.gz]
    peak: path/to/the/atac/files/{sample}.other.things.bed.gz
    vcf_id: path/to/sample/id/mappings.tsv

ex6

# in this situation, some of our samples have a corresponding dna bam but others don't
data1:
    wgs: path/to/the/wgs/files/{sample}.other.things.in.the.fname.bam
    atac: [path/to/the/wgs-atac/files/{sample}.other.things.1.fq.gz, path/to/the/wgs-atac/files/{sample}.other.things.2.fq.gz]
    peak: path/to/the/wgs-peak/files/{sample}.other.things.narrowPeak
    vcf_id: path/to/sample/id/mappings1.tsv
data2:
    atac: [path/to/the/atac/files/{sample}.other.things.1.fq.gz, path/to/the/atac/files/{sample}.other.things.2.fq.gz]
    peak: path/to/the/peak/files/{sample}.other.things.narrowPeak.gz
    vcf_id: path/to/sample/id/mappings2.tsv

aryarm commented 3 years ago

We can use glob_wildcards() to infer the sample IDs from the paths. For example, if you have a config file loaded into a dictionary samples_yml with the contents from ex1 above,

samples = glob_wildcards(samples_yml['data']['wgs']).sample

aryarm / as_analysis

simplify how samples are specified #72

ex1

ex2

ex3

ex4

ex5

ex6