Clipper doesn't recognize contigs from GRCh38, even with GRCh38 given as species

MattBrauer commented 4 years ago

The readme suggests that GRCh38 is one of the reference genomes that clipper supports natively. But running: clipper -b /data/sample1.bam -s GRCh38 -o /data/sample1.bed

gives the following error:

Traceback (most recent call last):
  File "/opt/conda/envs/clipper3/lib/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "/opt/conda/envs/clipper3/lib/python3.7/site-packages/clipper/src/call_peak.py", line 932, in call_peaks
    subset_reads = list(bam_fileobj.fetch(reference=str(interval.chrom), start=interval.start, end=interval.stop))
  File "pysam/libcalignmentfile.pyx", line 1081, in pysam.libcalignmentfile.AlignmentFile.fetch
  File "pysam/libchtslib.pyx", line 686, in pysam.libchtslib.HTSFile.parse_region
ValueError: invalid contig `chr1_KI270708v1_random`

MattBrauer commented 4 years ago

grepping chr1_KI270708v1_random shows that it is present in both data/GRCh38.AS.STRUCTURE.COMPILED.gff and in the bed files in data/regions (e.g., data/regions/GRCh38_exons.bed).

matteofloris commented 4 years ago

I have the same problem.

byee4 commented 4 years ago

Hi, can anyone share the samtools header? samtools view -H sample.bam > header.txt

matteofloris commented 4 years ago

Sure.

Hi, can anyone share the samtools header? samtools view -H sample.bam > header.txt

@HD VN:1.4 SO:coordinate @SQ SN:chr1 LN:248956422 @SQ SN:chr2 LN:242193529 @SQ SN:chr3 LN:198295559 @SQ SN:chr4 LN:190214555 @SQ SN:chr5 LN:181538259 @SQ SN:chr6 LN:170805979 @SQ SN:chr7 LN:159345973 @SQ SN:chr8 LN:145138636 @SQ SN:chr9 LN:138394717 @SQ SN:chr10 LN:133797422 @SQ SN:chr11 LN:135086622 @SQ SN:chr12 LN:133275309 @SQ SN:chr13 LN:114364328 @SQ SN:chr14 LN:107043718 @SQ SN:chr15 LN:101991189 @SQ SN:chr16 LN:90338345 @SQ SN:chr17 LN:83257441 @SQ SN:chr18 LN:80373285 @SQ SN:chr19 LN:58617616 @SQ SN:chr20 LN:64444167 @SQ SN:chr21 LN:46709983 @SQ SN:chr22 LN:50818468 @SQ SN:chrX LN:156040895 @SQ SN:chrY LN:57227415 @SQ SN:chrM LN:16569 @SQ SN:GL000008.2 LN:209709 @SQ SN:GL000009.2 LN:201709 @SQ SN:GL000194.1 LN:191469 @SQ SN:GL000195.1 LN:182896 @SQ SN:GL000205.2 LN:185591 @SQ SN:GL000208.1 LN:92689 @SQ SN:GL000213.1 LN:164239 @SQ SN:GL000214.1 LN:137718 @SQ SN:GL000216.2 LN:176608 @SQ SN:GL000218.1 LN:161147 @SQ SN:GL000219.1 LN:179198 @SQ SN:GL000220.1 LN:161802 @SQ SN:GL000221.1 LN:155397 @SQ SN:GL000224.1 LN:179693 @SQ SN:GL000225.1 LN:211173 @SQ SN:GL000226.1 LN:15008 @SQ SN:KI270302.1 LN:2274 @SQ SN:KI270303.1 LN:1942 @SQ SN:KI270304.1 LN:2165 @SQ SN:KI270305.1 LN:1472 @SQ SN:KI270310.1 LN:1201 @SQ SN:KI270311.1 LN:12399 @SQ SN:KI270312.1 LN:998 @SQ SN:KI270315.1 LN:2276 @SQ SN:KI270316.1 LN:1444 @SQ SN:KI270317.1 LN:37690 @SQ SN:KI270320.1 LN:4416 @SQ SN:KI270322.1 LN:21476 @SQ SN:KI270329.1 LN:1040 @SQ SN:KI270330.1 LN:1652 @SQ SN:KI270333.1 LN:2699 @SQ SN:KI270334.1 LN:1368 @SQ SN:KI270335.1 LN:1048 @SQ SN:KI270336.1 LN:1026 @SQ SN:KI270337.1 LN:1121 @SQ SN:KI270338.1 LN:1428 @SQ SN:KI270340.1 LN:1428 @SQ SN:KI270362.1 LN:3530 @SQ SN:KI270363.1 LN:1803 @SQ SN:KI270364.1 LN:2855 @SQ SN:KI270366.1 LN:8320 @SQ SN:KI270371.1 LN:2805 @SQ SN:KI270372.1 LN:1650 @SQ SN:KI270373.1 LN:1451 @SQ SN:KI270374.1 LN:2656 @SQ SN:KI270375.1 LN:2378 @SQ SN:KI270376.1 LN:1136 @SQ SN:KI270378.1 LN:1048 @SQ SN:KI270379.1 LN:1045 @SQ SN:KI270381.1 LN:1930 @SQ SN:KI270382.1 LN:4215 @SQ SN:KI270383.1 LN:1750 @SQ SN:KI270384.1 LN:1658 @SQ SN:KI270385.1 LN:990 @SQ SN:KI270386.1 LN:1788 @SQ SN:KI270387.1 LN:1537 @SQ SN:KI270388.1 LN:1216 @SQ SN:KI270389.1 LN:1298 @SQ SN:KI270390.1 LN:2387 @SQ SN:KI270391.1 LN:1484 @SQ SN:KI270392.1 LN:971 @SQ SN:KI270393.1 LN:1308 @SQ SN:KI270394.1 LN:970 @SQ SN:KI270395.1 LN:1143 @SQ SN:KI270396.1 LN:1880 @SQ SN:KI270411.1 LN:2646 @SQ SN:KI270412.1 LN:1179 @SQ SN:KI270414.1 LN:2489 @SQ SN:KI270417.1 LN:2043 @SQ SN:KI270418.1 LN:2145 @SQ SN:KI270419.1 LN:1029 @SQ SN:KI270420.1 LN:2321 @SQ SN:KI270422.1 LN:1445 @SQ SN:KI270423.1 LN:981 @SQ SN:KI270424.1 LN:2140 @SQ SN:KI270425.1 LN:1884 @SQ SN:KI270429.1 LN:1361 @SQ SN:KI270435.1 LN:92983 @SQ SN:KI270438.1 LN:112505 @SQ SN:KI270442.1 LN:392061 @SQ SN:KI270448.1 LN:7992 @SQ SN:KI270465.1 LN:1774 @SQ SN:KI270466.1 LN:1233 @SQ SN:KI270467.1 LN:3920 @SQ SN:KI270468.1 LN:4055 @SQ SN:KI270507.1 LN:5353 @SQ SN:KI270508.1 LN:1951 @SQ SN:KI270509.1 LN:2318 @SQ SN:KI270510.1 LN:2415 @SQ SN:KI270511.1 LN:8127 @SQ SN:KI270512.1 LN:22689 @SQ SN:KI270515.1 LN:6361 @SQ SN:KI270516.1 LN:1300 @SQ SN:KI270517.1 LN:3253 @SQ SN:KI270518.1 LN:2186 @SQ SN:KI270519.1 LN:138126 @SQ SN:KI270521.1 LN:7642 @SQ SN:KI270522.1 LN:5674 @SQ SN:KI270528.1 LN:2983 @SQ SN:KI270529.1 LN:1899 @SQ SN:KI270530.1 LN:2168 @SQ SN:KI270538.1 LN:91309 @SQ SN:KI270539.1 LN:993 @SQ SN:KI270544.1 LN:1202 @SQ SN:KI270548.1 LN:1599 @SQ SN:KI270579.1 LN:31033 @SQ SN:KI270580.1 LN:1553 @SQ SN:KI270581.1 LN:7046 @SQ SN:KI270582.1 LN:6504 @SQ SN:KI270583.1 LN:1400 @SQ SN:KI270584.1 LN:4513 @SQ SN:KI270587.1 LN:2969 @SQ SN:KI270588.1 LN:6158 @SQ SN:KI270589.1 LN:44474 @SQ SN:KI270590.1 LN:4685 @SQ SN:KI270591.1 LN:5796 @SQ SN:KI270593.1 LN:3041 @SQ SN:KI270706.1 LN:175055 @SQ SN:KI270707.1 LN:32032 @SQ SN:KI270708.1 LN:127682 @SQ SN:KI270709.1 LN:66860 @SQ SN:KI270710.1 LN:40176 @SQ SN:KI270711.1 LN:42210 @SQ SN:KI270712.1 LN:176043 @SQ SN:KI270713.1 LN:40745 @SQ SN:KI270714.1 LN:41717 @SQ SN:KI270715.1 LN:161471 @SQ SN:KI270716.1 LN:153799 @SQ SN:KI270717.1 LN:40062 @SQ SN:KI270718.1 LN:38054 @SQ SN:KI270719.1 LN:176845 @SQ SN:KI270720.1 LN:39050 @SQ SN:KI270721.1 LN:100316 @SQ SN:KI270722.1 LN:194050 @SQ SN:KI270723.1 LN:38115 @SQ SN:KI270724.1 LN:39555 @SQ SN:KI270725.1 LN:172810 @SQ SN:KI270726.1 LN:43739 @SQ SN:KI270727.1 LN:448248 @SQ SN:KI270728.1 LN:1872759 @SQ SN:KI270729.1 LN:280839 @SQ SN:KI270730.1 LN:112551 @SQ SN:KI270731.1 LN:150754 @SQ SN:KI270732.1 LN:41543 @SQ SN:KI270733.1 LN:179772 @SQ SN:KI270734.1 LN:165050 @SQ SN:KI270735.1 LN:42811 @SQ SN:KI270736.1 LN:181920 @SQ SN:KI270737.1 LN:103838 @SQ SN:KI270738.1 LN:99375 @SQ SN:KI270739.1 LN:73985 @SQ SN:KI270740.1 LN:37240 @SQ SN:KI270741.1 LN:157432 @SQ SN:KI270742.1 LN:186739 @SQ SN:KI270743.1 LN:210658 @SQ SN:KI270744.1 LN:168472 @SQ SN:KI270745.1 LN:41891 @SQ SN:KI270746.1 LN:66486 @SQ SN:KI270747.1 LN:198735 @SQ SN:KI270748.1 LN:93321 @SQ SN:KI270749.1 LN:158759 @SQ SN:KI270750.1 LN:148850 @SQ SN:KI270751.1 LN:150742 @SQ SN:KI270752.1 LN:27745 @SQ SN:KI270753.1 LN:62944 @SQ SN:KI270754.1 LN:40191 @SQ SN:KI270755.1 LN:36723 @SQ SN:KI270756.1 LN:79590 @SQ SN:KI270757.1 LN:71251 @PG ID:STAR PN:STAR VN:2.7.6a CL:/home/matteo/Scrivania/iCLIP_NF90/BIN/STAR-2.7.6a/bin/Linux_x86_64/STAR --runMode alignReads --runThreadN 6 --genomeDir GENOME/ --readFilesIn FASTQ/Rt9clip.fastq.gz --readFilesCommand zcat --outFileNamePrefix ALIGN/Rt9 --outSAMtype BAM SortedByCoordinate --outSAMunmapped Within KeepPairs --outSAMprimaryFlag AllBestScore --outFilterMultimapNmax 10 --outFilterMismatchNmax 2 --alignEndsType EndToEnd --sjdbGTFfile GENOME/gencode.v35.basic.annotation.gtf @CO user command line: /home/matteo/Scrivania/iCLIP_NF90/BIN/STAR-2.7.6a/bin/Linux_x86_64/STAR --runMode alignReads --runThreadN 6 --genomeDir GENOME/ --readFilesIn FASTQ/Rt9clip.fastq.gz --outFileNamePrefix ALIGN/Rt9 --outSAMprimaryFlag AllBestScore --outFilterMultimapNmax 10 --outFilterMismatchNmax 2 --alignEndsType EndToEnd --outSAMtype BAM SortedByCoordinate --outSAMunmapped Within KeepPairs --sjdbGTFfile GENOME/gencode.v35.basic.annotation.gtf --readFilesCommand zcat

byee4 commented 4 years ago

It looks like the fasta used to generate the STAR index doesn't match what clipper expects for those chromosomes, do you know where the fasta reference was downloaded from? As a potential workaround, is it possible to map to the ENCODE reference? https://www.encodeproject.org/files/GRCh38_no_alt_analysis_set_GCA_000001405.15/

matteofloris commented 4 years ago

I aligned my fastq files to the reference GENCODE dataset. Btw, 1 hour ago I rerun with species = GRCh38_v29e instead of GRCh38... Clipper is now running... How long it should take?

Matteo Floris, PhD - Univ. Sassari

Il mar 13 ott 2020, 20:33 Brian Yee notifications@github.com ha scritto:

It looks like the fasta used to generate the STAR index doesn't match what clipper expects for those chromosomes, do you know where the fasta reference was downloaded from? As a potential workaround, is it possible to map to the ENCODE reference? https://www.encodeproject.org/files/GRCh38_no_alt_analysis_set_GCA_000001405.15/

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/YeoLab/clipper/issues/87#issuecomment-707930695, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACZ5JNEDXZDVVJPRJGTEETDSKSMOBANCNFSM4R7NT6YQ .

byee4 commented 4 years ago

Clipper usually takes all day on our 16-core nodes, faster on 32-cores. So it will depend on your setup, but I wouldn't wait around for it to finish

matteofloris commented 4 years ago

After 1-night-long run, another similar error:

"ValueError: invalid contig chr14_GL000009v2_random".

Is there any way to exclude those non-canonical contigs from the analysis? There is no other workaround than mapping to the ENCODE ref?

Thank you so much for your support! Matteo

Il giorno mar 13 ott 2020 alle ore 21:08 Brian Yee notifications@github.com ha scritto:

Clipper usually takes all day on our 16-core nodes, faster on 32-cores. So it will depend on your setup, but I wouldn't wait around for it to finish

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/YeoLab/clipper/issues/87#issuecomment-707949380, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACZ5JNDBBHQBLEHI4YEULN3SKSQSPANCNFSM4R7NT6YQ .

--

Matteo Floris, PhD Associate Professor - Medical Genetics Department of Biomedical Sciences University of Sassari (Italy) Tel +39.333.48.57.679 email: mfloris@uniss.it, matteo.floris@gmail.com

byee4 commented 4 years ago

Hm, the only other ways I can think of is to create a new reference set and re-install Clipper (see instructions here). Or, to hack a bit, remove the offending annotations on chr14_GL000009v2_random from these files and re-install:

https://github.com/YeoLab/clipper/blob/master/clipper/data/GRCh38_v29e.AS.STRUCTURE.COMPILED.gff https://github.com/YeoLab/clipper/blob/master/clipper/data/regions/GRCh38_v29e_genes.bed https://github.com/YeoLab/clipper/blob/master/clipper/data/regions/GRCh38_v29e_exons.bed

matteofloris commented 4 years ago

It looks like the fasta used to generate the STAR index doesn't match what clipper expects for those chromosomes, do you know where the fasta reference was downloaded from? As a potential workaround, is it possible to map to the ENCODE reference? https://www.encodeproject.org/files/GRCh38_no_alt_analysis_set_GCA_000001405.15/

I mapped to the ENCODE ref and now it works...

BuddahKat commented 2 years ago

I came across this after running into the same issue (I used GENCODE reference during mapping but Clipper wants chromosome names to be in UCSC naming convention). My solution was just to use samtools to replace the header of the alignment file with one containing the correct names. As long as you don't reorder the header and only do substitutions then it shouldn't mess anything up since the BAM file uses integer indexes instead of the actual chromosome names. So steps would be -

Extract header from the original alignment:

samtools view -H alignment_oldNames.bam > header.sam

Convert the names (if using a file where 1st column is 'oldName' and 2nd column is 'newName' like those from dpryan79/ChromosomeMappings, could do something in bash like:
```
#!/bin/sh
while read a b
do
sed -i "s/$a/$b/g" header.sam
done < conversionTable.txt
```
Replace the header:

samtools reheader header.sam alignment_oldNames.bam > alignment_newNames.bam

YeoLab / clipper

Clipper doesn't recognize contigs from GRCh38, even with GRCh38 given as species #87