TheJacksonLaboratory / splicing-pipelines-nf

Repository for the Anczukow-Lab splicing pipeline
14 stars 9 forks source link

Simplify the Gen3-DRS download option #287

Closed Vlad-Dembrovskyi closed 2 years ago

Vlad-Dembrovskyi commented 2 years ago

Currently we need to manually edit the manifest file before using it for pipeline to only include the samples of interest. We need a way to only provide samples of interest and manifest to pipeline so that pipeline edits the manifest itself. Example code following shortly.

angarb commented 2 years ago

Currently this is the method: image

angarb commented 2 years ago

I would like to:

  1. Add an input parameter called manifest (with the input being the .json manifest file downloaded from GTEX)
  2. For the reads.csv, I would like to input a list the files I want (either bams or crams) image
  3. Then, perhaps the subsetting of the manifest file can be done automatically. We would subset the manifest for the bam entries of interest in the reads.csv.

Example of manifest.json file (we will use the reads.csv to subset the "file_name" in the manifest(there could be several file types in the manifest): image

Example of manifest.csv file (we will use the reads.csv to subset the second column/"file_name" in the manifest): image

angarb commented 2 years ago

Alternatively, we could give the specimen id GTEX-XXXX-XXXX-XX-XXXXX and it subsets the manifest for the .bam file entries.

This is possibly preferred, but it is important to note:

  1. There could be .bam and .bam.bai files in the manifest.
  2. Some files end in .Aligned.sortedByCoord.out.patched.md.bam and some in .Aligned.sortedByCoord.out.patched.bam.
  3. This means that bam would be the only option (though that seems fine as the cram files appear to only be for DNA seq in GTEX)
Vlad-Dembrovskyi commented 2 years ago

Addition: we can save the filenames requested but not found in original manifest file into a not_found_GTEX_samples.txt file. We Should also print them as warnings to stdout.