WeichenZhou / PALMER

Pre-mAsking Long reads for Mobile Element inseRtion
MIT License
12 stars 5 forks source link

PALMER Does Not Run Correctly #10

Closed RoxaneDunbar closed 5 years ago

RoxaneDunbar commented 5 years ago

I'm having issues with running PALMER. Here is my work through:

7/2/19 NA24385 Downloaded individual Chr bam files using ‘wget’ from:

ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG002_NA24385_son/PacBio_MtSinai_NIST/CSHL_bwamem_bam_GRCh37/

$ ls BWA-MEM_Chr10_HG002_merged_11_12.sort.bam BWA-MEM_Chr11_HG002_merged_11_12.sort.bam BWA-MEM_Chr12_HG002_merged_11_12.sort.bam BWA-MEM_Chr13_HG002_merged_11_12.sort.bam BWA-MEM_Chr14_HG002_merged_11_12.sort.bam BWA-MEM_Chr15_HG002_merged_11_12.sort.bam BWA-MEM_Chr16_HG002_merged_11_12.sort.bam BWA-MEM_Chr17_HG002_merged_11_12.sort.bam BWA-MEM_Chr18_HG002_merged_11_12.sort.bam BWA-MEM_Chr19_HG002_merged_11_12.sort.bam BWA-MEM_Chr1_HG002_merged_11_12.sort.bam BWA-MEM_Chr20_HG002_merged_11_12.sort.bam BWA-MEM_Chr21_HG002_merged_11_12.sort.bam BWA-MEM_Chr22_HG002_merged_11_12.sort.bam BWA-MEM_Chr2_HG002_merged_11_12.sort.bam BWA-MEM_Chr3_HG002_merged_11_12.sort.bam BWA-MEM_Chr4_HG002_merged_11_12.sort.bam BWA-MEM_Chr5_HG002_merged_11_12.sort.bam BWA-MEM_Chr6_HG002_merged_11_12.sort.bam BWA-MEM_Chr7_HG002_merged_11_12.sort.bam BWA-MEM_Chr8_HG002_merged_11_12.sort.bam BWA-MEM_Chr9_HG002_merged_11_12.sort.bam BWA-MEM_ChrX_HG002_merged_11_12.sort.bam BWA-MEM_ChrY_HG002_merged_11_12.sort.bam

Merging individual bam files into 1 using samtools:

$ ls /media/RAID/rdunbar/PACBIO/NA24385/RAW_DATA/*.bam > bamlist.txt

$ less bamlist.txt /media/RAID/rdunbar/PACBIO/NA24385/RAW_DATA/BWA-MEM_Chr10_HG002_merged_11_12.sort.bam /media/RAID/rdunbar/PACBIO/NA24385/RAW_DATA/BWA-MEM_Chr11_HG002_merged_11_12.sort.bam /media/RAID/rdunbar/PACBIO/NA24385/RAW_DATA/BWA-MEM_Chr12_HG002_merged_11_12.sort.bam /media/RAID/rdunbar/PACBIO/NA24385/RAW_DATA/BWA-MEM_Chr13_HG002_merged_11_12.sort.bam /media/RAID/rdunbar/PACBIO/NA24385/RAW_DATA/BWA-MEM_Chr14_HG002_merged_11_12.sort.bam /media/RAID/rdunbar/PACBIO/NA24385/RAW_DATA/BWA-MEM_Chr15_HG002_merged_11_12.sort.bam /media/RAID/rdunbar/PACBIO/NA24385/RAW_DATA/BWA-MEM_Chr16_HG002_merged_11_12.sort.bam /media/RAID/rdunbar/PACBIO/NA24385/RAW_DATA/BWA-MEM_Chr17_HG002_merged_11_12.sort.bam /media/RAID/rdunbar/PACBIO/NA24385/RAW_DATA/BWA-MEM_Chr18_HG002_merged_11_12.sort.bam /media/RAID/rdunbar/PACBIO/NA24385/RAW_DATA/BWA-MEM_Chr19_HG002_merged_11_12.sort.bam /media/RAID/rdunbar/PACBIO/NA24385/RAW_DATA/BWA-MEM_Chr1_HG002_merged_11_12.sort.bam /media/RAID/rdunbar/PACBIO/NA24385/RAW_DATA/BWA-MEM_Chr20_HG002_merged_11_12.sort.bam /media/RAID/rdunbar/PACBIO/NA24385/RAW_DATA/BWA-MEM_Chr21_HG002_merged_11_12.sort.bam /media/RAID/rdunbar/PACBIO/NA24385/RAW_DATA/BWA-MEM_Chr22_HG002_merged_11_12.sort.bam /media/RAID/rdunbar/PACBIO/NA24385/RAW_DATA/BWA-MEM_Chr2_HG002_merged_11_12.sort.bam /media/RAID/rdunbar/PACBIO/NA24385/RAW_DATA/BWA-MEM_Chr3_HG002_merged_11_12.sort.bam /media/RAID/rdunbar/PACBIO/NA24385/RAW_DATA/BWA-MEM_Chr4_HG002_merged_11_12.sort.bam /media/RAID/rdunbar/PACBIO/NA24385/RAW_DATA/BWA-MEM_Chr5_HG002_merged_11_12.sort.bam /media/RAID/rdunbar/PACBIO/NA24385/RAW_DATA/BWA-MEM_Chr6_HG002_merged_11_12.sort.bam /media/RAID/rdunbar/PACBIO/NA24385/RAW_DATA/BWA-MEM_Chr7_HG002_merged_11_12.sort.bam /media/RAID/rdunbar/PACBIO/NA24385/RAW_DATA/BWA-MEM_Chr8_HG002_merged_11_12.sort.bam /media/RAID/rdunbar/PACBIO/NA24385/RAW_DATA/BWA-MEM_Chr9_HG002_merged_11_12.sort.bam /media/RAID/rdunbar/PACBIO/NA24385/RAW_DATA/BWA-MEM_ChrX_HG002_merged_11_12.sort.bam /media/RAID/rdunbar/PACBIO/NA24385/RAW_DATA/BWA-MEM_ChrY_HG002_merged_11_12.sort.bam

$ samtools merge -b bamlist.txt NA24385_AllChr_RD07022019.bam

Indexed bam:

$ samtools index NA24385_AllChr_RD07022019.bam

Reference files:

/media/RAID/rdunbar/Ashkenazim_Trio/hs37d5.fa

Indexed reference files in: /media/RAID/rdunbar/Ashkenazim_Trio

8/2/19

Installed PALMER Version: 1.3.3 as per: https://github.com/mills-lab/PALMER

Ran PALMER on all chromosomes:

$ ./PALMER --input /media/RAID/rdunbar/PACBIO/NA24385/RAW_DATA/NA24385_AllChr_RD07022019.bam --workdir /media/RAID/rdunbar/PACBIO/NA24385/PALMER_OUTPUT/ --ref_ver GRCh37 --output NA24385_PALMER_RD08022019 --type LINE --ref_fa /media/RAID/rdunbar/Ashkenazim_Trio/hs37d5.fa _Variant type is LINE Working directory is /media/RAID/rdunbar/PACBIO/NA24385/PALMER_OUTPUT/ Input file is /media/RAID/rdunbar/PACBIO/NA24385/RAW_DATA/NA24385_AllChr_RD07022019.bam Output file is /media/RAID/rdunbar/PACBIO/NA24385/PALMER_OUTPUT/NA24385_PALMER_RD08022019 Running on ALL ref is GRCh37 THERE ARE 3 REGIONS TO COUNT. Pre-masking step & single read calling step is initiated. Working in the direcotry mkdir /media/RAID/rdunbar/PACBIO/NA24385/PALMER_OUTPUT/version_0_0/. [main_samview] region "version:0-0" could not be parsed. Continue anyway.

  1. Samtools Step for region version_0_0 is now completed. BLAST engine error: Empty CBlastQueryVector Pre-masking step & single read calling step for version_0_0 completed. TSD_module step for version_0_0 completed. False positive exclusion step for version_0_0 completed. Calling step for version_0_0 completed. Working in the direcotry mkdir /media/RAID/rdunbar/PACBIO/NA24385/PALMER_OUTPUT/version_0_0/. mkdir: cannot create directory â/media/RAID/rdunbar/PACBIO/NA24385/PALMER_OUTPUT/version_0_0/â: File exists [main_samview] region "version:0-0" could not be parsed. Continue anyway.
  2. Samtools Step for region version_0_0 is now completed. BLAST engine error: Empty CBlastQueryVector Pre-masking step & single read calling step for version_0_0 completed. TSD_module step for version_0_0 completed. False positive exclusion step for version_0_0 completed. Calling step for version_0_0 completed. Working in the direcotry mkdir /media/RAID/rdunbar/PACBIO/NA24385/PALMER_OUTPUT/version_0_0/. mkdir: cannot create directory â/media/RAID/rdunbar/PACBIO/NA24385/PALMER_OUTPUT/version_0_0/â: File exists [main_samview] region "version:0-0" could not be parsed. Continue anyway.
  3. Samtools Step for region version_0_0 is now completed. BLAST engine error: Empty CBlastQueryVector Pre-masking step & single read calling step for version_0_0 completed. TSD_module step for version_0_0 completed. False positive exclusion step for version_0_0 completed. Calling step for version_00 completed. Merging step is initiated.

During this run, PALMER sticks on 'Merging step is initiated.", doesn't use any memory, uses 100% CPU, but does create folders:

pic1 pic2

All other folders/files are empty. I’m wondering if you know what version_0_0/version:0-0 is? Or whether it is something within my bam file that PALMER thinks is a chromosome?

I have also run PALMER using "--chr 1" and "--chr chr1", "--chr chrY", "--chr chr21", however with these runs, no folders are created at all, no memory is used, 100% CPU is used, and even after 72+ hours running time, sticks on "Pre-masking step & single read calling step is initiated".

$ ./PALMER --input /media/RAID/rdunbar/PACBIO/NA24385/RAW_DATA/NA24385_AllChr_RD07022019.bam --workdir /media/RAID/rdunbar/PACBIO/NA24385/PALMER_INDIVID_CHR/ --ref_ver GRCh37 --output NA24385_PALMER_RD08022019 --type LINE --chr chr1 --ref_fa /media/RAID/rdunbar/Ashkenazim_Trio/hs37d5.fa _Variant type is LINE Working directory is /media/RAID/rdunbar/PACBIO/NA24385/PALMER_INDIVID_CHR/ Input file is /media/RAID/rdunbar/PACBIO/NA24385/RAW_DATA/NA24385_AllChr_RD07022019.bam Output file is /media/RAID/rdunbar/PACBIO/NA24385/PALMER_INDIVID_CHR/NA24385_PALMERRD08022019 Running on chr1 ref is GRCh37 THERE ARE 1 REGIONS TO COUNT. Pre-masking step & single read calling step is initiated.

Any help with this would be greatly appreciated.

Kindest regards, Roxane

WeichenZhou commented 5 years ago

Hi Roxane, Thanks for posting up this issue. I really dont know why there is a Version0_0 folder, since PALMER does not have any files named like this. The output of second run for chr1 "THERE ARE 1 REGIONS TO COUNT" is also concerning. Could you please get into the PALMER folder, check and post the size and names of the files under 'index'? I guess it could be something wrong with the index file. If so, you can delete PALMER and re-clone them to check whether it works.

Best, Arthur

RoxaneDunbar commented 5 years ago

Ok, here are some updates:

I looked at the index files and found that the GitHub clone did not pull some of these files correctly.

$ ls -lh total 66M -rw-rw-r-- 1 rdunbar rdunbar 133 Feb 15 08:54 Alu.regions.GRCh37 -rw-rw-r-- 1 rdunbar rdunbar 36M Feb 15 08:54 Alu.regions.hg19 -rw-rw-r-- 1 rdunbar rdunbar 133 Feb 15 08:54 LINEs.regions.GRCh37 -rw-rw-r-- 1 rdunbar rdunbar 53K Feb 15 09:03 LINEs.regions.GRCh37.1 -rw-rw-r-- 1 rdunbar rdunbar 133 Feb 15 08:54 LINEs.regions.GRCh38 -rw-rw-r-- 1 rdunbar rdunbar 30M Feb 15 08:54 LINEs.regions.hg19 -rw-rw-r-- 1 rdunbar rdunbar 131 Feb 15 08:54 SVA.regions.GRCh37 -rw-rw-r-- 1 rdunbar rdunbar 113K Feb 15 08:54 SVA.regions.hg19 -rw-rw-r-- 1 rdunbar rdunbar 130 Feb 15 08:54 region.split.index.GRCh37 -rw-rw-r-- 1 rdunbar rdunbar 130 Feb 15 08:54 region.split.index.GRCh38 -rw-rw-r-- 1 rdunbar rdunbar 73K Feb 15 08:54 region.split.index.hg19

I removed PALMER and re-cloned from GitHub, however, still had the same issue. I've downloaded the individual files using wget and replaced them in the index folder for region.split.index.GRCh37 and LINEs.regions.GRCh37, and it does run (kind of), but gives:

Running on ALL ref is GRCh37 THERE ARE 1285 REGIONS TO COUNT. Pre-masking step & single read calling step is initiated. Working in the direcotry mkdir /media/RAID/rdunbar/PACBIO/NA24385/PALMER_INDIVID_CHR/<!DOCTYPE_0_0/. sh: 1: cannot open !DOCTYPE_0_0/: No such file sh: 1: cannot open !DOCTYPE:0-0: No such file sh: 1: cannot open !DOCTYPE_0_0/region.pre.sam: No such file

  1. Samtools Step for region <!DOCTYPE_0_0 is now completed. CANNOT OPEN FILE, 'region.sam' CANNOT OPEN FILE, 'SEQ.masked' sh: 1: cannot open !DOCTYPE_0_0/blastn.txt: No such file sh: 1: cannot create /media/RAID/rdunbar/PACBIO/NA24385/PALMER_INDIVID_CHR/: Is a directory CANNOT OPEN FILE, 'cigar.2' Pre-masking step & single read calling step for <!DOCTYPE_0_0 completed.

I decided to run using hg19 as below:

./PALMER --input /media/RAID/rdunbar/PACBIO/NA24385/RAW_DATA/NA24385_AllChr_SORTED_RD12022019.bam --workdir /media/RAID/rdunbar/PACBIO/NA24385/PALMER_INDIVID_CHR/ --ref_ver hg19 --output NA24385_PALMER_RD15022019 --type LINE --chr chr21 --ref_fa /media/RAID/rdunbar/Ashkenazim_Trio/hs37d5.fa

It ran fine, but completed in just a couple minutes with nothing in the output file.

... Working in the direcotry mkdir /media/RAID/rdunbar/PACBIO/NA24385/PALMER_INDIVID_CHR/chr21_47000001_48000000/. [main_samview] region "chr21:47000001-48000000" specifies an unknown reference name. Continue anyway.

  1. Samtools Step for region chr21_47000001_48000000 is now completed. BLAST engine error: Empty CBlastQueryVector Pre-masking step & single read calling step for chr21_47000001_48000000 completed. TSD_module step for chr21_47000001_48000000 completed. False positive exclusion step for chr21_47000001_48000000 completed. Calling step for chr21_47000001_48000000 completed. Working in the direcotry mkdir /media/RAID/rdunbar/PACBIO/NA24385/PALMER_INDIVID_CHR/chr21_48000001_48129895/. [main_samview] region "chr21:48000001-48129895" specifies an unknown reference name. Continue anyway.
  2. Samtools Step for region chr21_48000001_48129895 is now completed. BLAST engine error: Empty CBlastQueryVector Pre-masking step & single read calling step for chr21_48000001_48129895 completed. TSD_module step for chr21_48000001_48129895 completed. False positive exclusion step for chr21_48000001_48129895 completed. Calling step for chr21_48000001_48129895 completed. Merging step is initiated. Merging step completed. Final calls finished. Results are in /media/RAID/rdunbar/PACBIO/NA24385/PALMER_INDIVID_CHR/NA24385_PALMER_RD15022019

For each section (3000+), it gives this 'BLAST engine error: Empty CBlastQueryVector'

I wonder if some other files aren't cloning from GitHub correctly too? I'm going to try and install outside of GitHub and will keep you updated.

RoxaneDunbar commented 5 years ago

Further update:

I noticed that the files what weren't downloading correctly were stored with Git LSF. I didn't have this installed. I've installed Git LSF, and then reinstalled PALMER - all installed correctly now.

I'm running chr21 using GRCh37 and it seems to be running fine now. I will update again when it's completed this chromosome.

Thanks again for your help.

RoxaneDunbar commented 5 years ago

Update:

All looks to be working fine now. I've ran PALMER on Chr21 and Chr22 so far and have reported coordinates.

Thanks for the help Arthur.

Kindest regards, Roxane