Clinical-Genomics / cg

Glue between Clinical Genomics apps
8 stars 2 forks source link

Cohort Fastq linking for cancer non-tumor (i.e. normal/germline) cases/samples #972

Closed hassanfa closed 2 years ago

hassanfa commented 3 years ago

As a busy person, I want to drink coffee so that I wake up. And sometimes I want to link multiple Fastq files in an analysis directory to run BALSAMIC for generating a background data

Problem:

  1. I don't know if it is possible to do so. And it should not start analysis, just link fastq files.
  2. All these samples will NOT be tumor. These will be normal germline samples.

Expected outcome / suggested solution:

For a mutlie-tumor or multi-non-tumor case, I'd like to only link FastQ file and only create config. NO analysis should start.

Questions regarding issue:

Q: Does it need to run analysis? A: No Q: Do these cases need to be compressed? A: Maybe not. They are not validation cases, but they should be accessible to generate pool of normals easier Q: Will these FastQ files be internally sequenced or external? A: Internal for now, but it would be neat to be able to add externally sequenced samples Q: Will customer order these or will this case be created by us? A: It will be created by us. Q: Is this high priority? A: Probably. One of the customers is waiting for using a pool of normals in analysis, and we just received the list of samples that we can use to generate a mega case.

What needs to be done on the planning meeting:

See https://github.com/Clinical-Genomics/development/blob/master/git/issue-reports.md for more!

hassanfa commented 3 years ago

ongoing discussion with @keyvanelhami to use one of our customer's normal samples as panel of normals.

I'd like to also be able to create a case with NO tumor sample in it, but link it all in balsamic directory.

moonso commented 3 years ago

Could you describe a little more how you imagine input and output to look like?

Could it be something like:

cg workflow balsamic normal-pool --sample sample-id1 --sample sample-id2 ... --output path/to/analysis/dir/

Or what do you have in mind?

hassanfa commented 3 years ago

These pools of normal samples are >20 samples. I imagine typing it all can be problematic. If we create a case with all the samples we want, Will the following command be too ambitious?

cg workflow balsamic normal-pool <case_name>

These samples won't be stored in housekeeper, cause the analysis result will be added either to small reference repo or target_capture_bed repo.

moonso commented 3 years ago

Ok, will there be many differents pools or just a few that are being reused? If there are many I suggest that we implement so that the cli call can take a file with sample ids as argument. That would be a flexible way if creating new pools and change the pools. Something like:

$cat pool1.csv
sample1
sample2
...
$cg workflow balsamic normal-pool --samples pool1.csv --output /path/to/dir/

These samples won't be stored in housekeeper, cause the analysis result will be added either to small reference repo or target_capture_bed repo.

Do you mean that the samples don't exist in housekeeper at all? I assume that the system will know about the samples otherwise it will be tricky.

Mropat commented 3 years ago

Hi! Is this feature something that will be routinely used in production, or a research venture? If its not yet a routine analysis, perhaps we could build a specialized script or package to link files and create config. Currently our setup for running balsamic in production is tailored after a specified definition for what a case and its config file should look like.

hassanfa commented 3 years ago
moonso commented 3 years ago

Ok then I think I understand. We can consider if this should be a separate service or something that is included in the CG codebase. That choice should not affect the final thing so much.

hassanfa commented 3 years ago

Production will use it to generate pool of normal results. I think the solution should definitely consider that. Wherever it is, it should be easily accessible for them.

hassanfa commented 3 years ago

bump

Mropat commented 3 years ago

Hi! You cant do the linking part already!

  1. Make family for this with data_analysis = Balsamic and action = "hold" (important so that we dont try to autostart it like normal cases)
  2. run cg workflow balsamic link
  3. Create samplesheet/ config of your liking manually for now (save the spec if you want it implemented later)
  4. Do your analysis

We can automate this later, but will need a more clearly defined worklow and a project for this!

hassanfa commented 3 years ago

Update from cancer team:

ashwini06 commented 2 years ago

@Mropat: I see a lot of conversations happened in the past. May I know is it possible to link multiple fastq files from different cases to one caseid now?

Right now, I have around 50 samples (/home/proj/long-term-stage/cancer/PON_analysis_runs_APJ/GMCKsolid_PONsamplelist.txt), where fastq files need to be grouped under one case_id. Is it possible to do with cg now?

ashwini06 commented 2 years ago

Solution from @karlnyr

cg commands to link multiple fastq files to single case-id

[0|0|0] 10d [hiseq.clinical@hasta:~] [P_main] 19s 2 $ cg add family --priority standard -p OMIM-AUTO -a balsamic -dd scout cust000 panel_of_normal_20211222
givingcobra: new case added
[0|0|0] θ60° 10d [hiseq.clinical@hasta:~] [P_main] 22s $ for sample in `awk '$1 !~ /sample_id/ {print $1}' /home/proj/long-term-stage/cancer/PON_analysis_runs_APJ/GMCKsolid_normalblood_custID_VW.txt | uniq`; do cg add relationship -s unknown givingcobra $sample; done
cg workflow balsamic link givingcobra
Mropat commented 2 years ago

Additional features will not be implemented to address this