NCI-CGR / IlluminaSequencingAnalysis

All Illumina Sequencing Related project from Xin will be recorded in this repo
0 stars 0 forks source link

Ad-hoc Requirement: from Cenk #51

Open lxwgcool opened 2 years ago

lxwgcool commented 2 years ago

Hi all,

Once you dig out from under the holiday emails, need to start working on this request. Long story short, Stephen would like us to deliver some of the COVID WGS data to collaborators at NIAID. So need help from all of you to coordinate this:

  1. Amy, Vibha, Lisa, can you pull together a manifest of the first sets of samples that we sent to USU (the 96 low input + 384 standard)? I know Clifton’s group is starting to deliver the set of 768 (first set, not second), but I believe that has just started so I don’t believe we need to consider those.
  2. Once that manifest is prepared, Xin, can you slice out regions based on the attached .bed file from the final merged .bam files for these samples? You’ll also need to add a file name to the manifest so that they can map files:subjects.
  3. Lisa, Meredith, you can see the requested phenotype info they would like to have below. Let me know what you can/cannot provide.
  4. Nathan, no rush, but need to set up a Globus transfer for this once Xin has files prepared.

Let me know if questions or if a quick call would be useful. I think if we could have this data to them by mid Jan (week of the 10th) that would be ideal.

Thanks, Belynda

lxwgcool commented 2 years ago

New functions

  1. MatchSampleWithS3Archive
    • Check if there is any duplicate samples
    • Check the number of Sample that do not contain BAM Path
  2. RetrieveDataFromS3
    • Skip duplicate sample and empty sample
    • Create SLRUM job for each sample
  3. RetrieveFileFromS3.sh
    • Bash script retrieve each single CGR sample
    • Include both BAM and BAI
    • Record the time consuming
lxwgcool commented 2 years ago

Update Lisa's excel file

  1. All duplicated samples have been removed
  2. There are still 5 empty samples that cannot be found in USU dataset.
lxwgcool commented 2 years ago

New Features: Slice Target Reads from BAM, Update CSV

Details:

1: New function: Slice target read from BAM by using BED file

(1) Three related source code, including a) SourceCode/SGFBam.wrapper.sh b) SourceCode/SliceRegionFromBAM.py c) SourceCode/SliceTargetReadsByBed.sh
(2) Use samtools view/index (3) Generate both BAM and index file

2: New function: Update and export new Excel file

(1) Once the sliced BAM file be generated, the excel will be updated by appending BAM filename for each CGR samples (2) Export the updated CSV file to excel format.

lxwgcool commented 2 years ago

New modifications

  1. Change the retrieving output folder from USU_First_Batch_low_std_96_361_18 to USU_First_Batch_low_std_96_361_18/debug
  2. New function: always maintain maximum 20 jobs in running pool automatically & dynamically
  3. Correct the print error if CGR ID
  4. Print date in log file to make it be better readable.
  5. Evaluate if need to add "--checksum" in obj_get
  6. Use flag file to control the number of concurrency jobs
  7. Deploy crontab job in biowulf cluster
    */30 * * * * . /etc/bashrc && cd /home/lix33/lxwg/Git/IlluminaSequencingAnalysis/Ad-hoc/Cenk/SourceCode && python3 /home/lix33/lxwg/Git/IlluminaSequencingAnalysis/Ad-hoc/Cenk/SourceCode/GetFileFromS3.py >> /data/COVID_ADHOC/Sequencing/COVID_WGS/Data/USU_First_Batch_low_std_96_361_18/debug/Log/sum.log 2>&1
lxwgcool commented 2 years ago

Fix Bug: fail to call obj_* command line, crontab job failed

Details

  1. Solve the bug: use full path to replace the obj_* command line a) obj_ls -> /usr/local/bin/obj_ls b) obj_get -> /usr/local/bin/obj_get
  2. Print the number of running jobs and finished jobs
  3. Change the way to deploy crontab job a) Load bashrc from my personal account b) cd to source code folder first
lxwgcool commented 2 years ago

New Features: output some statistical results

Details:

  1. SliceRegionFromBAM.py (1) Output The Number of vSample (2) Output The Number of Running Samples

  2. UpdateExcel.py (1) Change the output file from csv to xlsx (2) Output Total Number of Sample in Excel (3) Output Total Number of Empty Sample (4) Output Empty Sample List

lxwgcool commented 2 years ago

New Requirement:

Generate a file to record all delivered samples.

Code

./IlluminaSequencingAnalysis/Ad-hoc/Cenk/SourceCode/GenerateDeliveredSampleList.py

Generated File Name

Delivered_sample_list.txt

The location of Generated Txt File

/data/COVID_ADHOC/Sequencing/COVID_WGS/ad-hoc/Cenk