PlantandFoodResearch / VariantAnalysis

A bioinformatic variant calling pipeline
GNU General Public License v3.0
5 stars 4 forks source link

Align Template Issues-Demo Data and Failure to Respect Config #4

Open cfljam opened 7 years ago

cfljam commented 7 years ago

When i run the alignment template there is a some zombie data I presume from the default/demo data coming through in the output

My config file looks like this:

sample file rep read experiment date comments
Pool1 C95VLANXX-2143-01-11-1_L007_R1.fastq.gz 1 R1 Req_10471_HighHealthAlleles 1/01/16 HighVitC/High Fruit Weight
Pool1 C95VLANXX-2143-01-11-1_L007_R2.fastq.gz 1 R2 Req_10471_HighHealthAlleles 1/01/16 HighVitC/High Fruit Weight
Pool1 C95VLANXX-2143-01-11-1_L008_R1.fastq.gz 2 R1 Req_10471_HighHealthAlleles 1/01/16 HighVitC/High Fruit Weight
Pool1 C95VLANXX-2143-01-11-1_L008_R2.fastq.gz 2 R2 Req_10471_HighHealthAlleles 1/01/16 HighVitC/High Fruit Weight
Pool2 C95VLANXX-2143-02-11-1_L007_R1.fastq.gz 1 R1 Req_10471_HighHealthAlleles 1/01/16 HighVitC/Low Fruit Weight
Pool2 C95VLANXX-2143-02-11-1_L007_R2.fastq.gz 1 R2 Req_10471_HighHealthAlleles 1/01/16 HighVitC/Low Fruit Weight
Pool2 C95VLANXX-2143-02-11-1_L008_R1.fastq.gz 2 R1 Req_10471_HighHealthAlleles 1/01/16 HighVitC/Low Fruit Weight
Pool2 C95VLANXX-2143-02-11-1_L008_R2.fastq.gz 2 R2 Req_10471_HighHealthAlleles 1/01/16 HighVitC/Low Fruit Weight
Pool3 C95VLANXX-2143-03-11-1_L007_R1.fastq.gz 1 R1 Req_10471_HighHealthAlleles 1/01/16 LowVitC/High Fruit Weight
Pool3 C95VLANXX-2143-03-11-1_L007_R2.fastq.gz 1 R2 Req_10471_HighHealthAlleles 1/01/16 LowVitC/High Fruit Weight
Pool3 C95VLANXX-2143-03-11-1_L008_R1.fastq.gz 2 R1 Req_10471_HighHealthAlleles 1/01/16 LowVitC/High Fruit Weight
Pool3 C95VLANXX-2143-03-11-1_L008_R2.fastq.gz 2 R1 Req_10471_HighHealthAlleles 1/01/16 LowVitC/High Fruit Weight
Pool4 C95VLANXX-2143-04-11-1_L007_R1.fastq.gz 1 R2 Req_10471_HighHealthAlleles 1/01/16 LowVitC/Low Fruit Weight
Pool4 C95VLANXX-2143-04-11-1_L007_R2.fastq.gz 1 R1 Req_10471_HighHealthAlleles 1/01/16 LowVitC/Low Fruit Weight
Pool4 C95VLANXX-2143-04-11-1_L008_R1.fastq.gz 2 R2 Req_10471_HighHealthAlleles 1/01/16 LowVitC/Low Fruit Weight
Pool4 C95VLANXX-2143-04-11-1_L008_R2.fastq.gz 2 R1 Req_10471_HighHealthAlleles 1/01/16 LowVitC/Low Fruit Weight

but there is spurious mystery output:

(py3r-env) [19:49][cfljam@aklppf31:align (master)] $ ls 240.add_read_group_id/ -lh
total 52G
-rw-rw-r--. 1 cfljam powerplant  6.9K Sep 25 13:48 add_read_group_id_HW1_1.bai
-rw-rw-r--. 1 cfljam powerplant  114K Sep 25 13:48 add_read_group_id_HW1_1.bam
-rw-rw-r--. 1 cfljam powerplant  6.9K Sep 25 13:48 add_read_group_id_HW2_2.bai
-rw-rw-r--. 1 cfljam powerplant  113K Sep 25 13:48 add_read_group_id_HW2_2.bam
-rw-rw-r--. 1 cfljam powerplant  956K Sep 25 18:04 add_read_group_id_Pool1_1.bai
-rw-rw-r--. 1 cfljam powerplant  5.6G Sep 25 18:04 add_read_group_id_Pool1_1.bam
-rw-rw-r--. 1 cfljam powerplant  1.6M Sep 25 19:04 add_read_group_id_Pool1_2.bai
-rw-rw-r--. 1 cfljam powerplant  8.6G Sep 25 19:04 add_read_group_id_Pool1_2.bam
-rw-rw-r--. 1 cfljam powerplant  942K Sep 25 16:40 add_read_group_id_Pool2_1.bai
-rw-rw-r--. 1 cfljam powerplant  4.7G Sep 25 16:40 add_read_group_id_Pool2_1.bam
-rw-rw-r--. 1 cfljam powerplant  1.6M Sep 25 19:07 add_read_group_id_Pool2_2.bai
-rw-rw-r--. 1 cfljam powerplant  8.3G Sep 25 19:07 add_read_group_id_Pool2_2.bam
-rw-rw-r--. 1 cfljam powerplant  974K Sep 25 16:21 add_read_group_id_Pool3_1.bai
-rw-rw-r--. 1 cfljam powerplant  4.4G Sep 25 16:21 add_read_group_id_Pool3_1.bam
-rw-rw-r--. 1 cfljam powerplant  993K Sep 25 17:13 add_read_group_id_Pool4_1.bai
-rw-rw-r--. 1 cfljam powerplant  6.5G Sep 25 17:13 add_read_group_id_Pool4_1.bam
-rw-rw-r--. 1 cfljam powerplant 1004K Sep 25 16:32 add_read_group_id_Pool5_1.bai
-rw-rw-r--. 1 cfljam powerplant  4.9G Sep 25 16:32 add_read_group_id_Pool5_1.bam

2 issues

  1. What are the HW1 and HW2 files doing in here??? they from demo data
  2. The config file lists 4 pool samples x 2 reps but Pool3_2 has become (non-existent) Pool5 rep 1
hdzierz commented 7 years ago

@cfljam

Can you point me to your notebook?

Thanks

Helge

cfljam commented 7 years ago

I have reproduced this at /workspace/cfljam/HighHealth/PoolSeq/alignEA/

with CL

$NXF_HOME/nextflow run \
    PlantandFoodResearch/VariantAnalysis/align.nf \
    --genus 'Actinidia' \
    --input_dir $INPUTDIR\
    --genome $EAFASTA \
    --design ./design.config \
    --output_dir $EAOUTPUTDIR

Notebook visible at /workspace/cfljam/HighHealth/PoolSeq/2016-10-23AlignPoolsEANextFlow.html/workspace/cfljam/HighHealth/PoolSeq/2016-10-23AlignPoolsEANextFlow.html

Config file is same as https://github.com/Actinidia/HighHealth/blob/master/PoolSeq/design.config

cfljam commented 7 years ago

Here is another possible issue:

ad position: EA01_02_scaffold378265:1
  INFO  2016-10-24 19:59:19     MarkDuplicates  Tracking 425633 as yet unmatched pairs. 3160 records in RAM.
  INFO  2016-10-24 20:12:15     MarkDuplicates  Read 116811705 records. 0 pairs never matched.
  INFO  2016-10-24 20:12:27     MarkDuplicates  After buildSortedReadEndLists freeMemory: 2309320448; totalMemory: 17644388352; maxMemory: 30542397440
  INFO  2016-10-24 20:12:27     MarkDuplicates  Will retain up to 954449920 duplicate indices before spilling to disk.
  [Mon Oct 24 20:12:49 NZDT 2016] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 1,837.42 minutes.
  Runtime.totalMemory()=24965545984
  To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
  Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
        at htsjdk.samtools.util.SortingLongCollection.<init>(SortingLongCollection.java:112)
        at picard.sam.markduplicates.MarkDuplicates.generateDuplicateIndexes(MarkDuplicates.java:570)
        at picard.sam.markduplicates.MarkDuplicates.doWork(MarkDuplicates.java:195)
        at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:209)
        at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95)
        at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)