Closed MattBrauer closed 3 years ago
grepping chr1_KI270708v1_random
shows that it is present in both data/GRCh38.AS.STRUCTURE.COMPILED.gff
and in the bed
files in data/regions
(e.g., data/regions/GRCh38_exons.bed
).
I have the same problem.
Hi, can anyone share the samtools header? samtools view -H sample.bam > header.txt
Sure.
Hi, can anyone share the samtools header?
samtools view -H sample.bam > header.txt
@HD VN:1.4 SO:coordinate @SQ SN:chr1 LN:248956422 @SQ SN:chr2 LN:242193529 @SQ SN:chr3 LN:198295559 @SQ SN:chr4 LN:190214555 @SQ SN:chr5 LN:181538259 @SQ SN:chr6 LN:170805979 @SQ SN:chr7 LN:159345973 @SQ SN:chr8 LN:145138636 @SQ SN:chr9 LN:138394717 @SQ SN:chr10 LN:133797422 @SQ SN:chr11 LN:135086622 @SQ SN:chr12 LN:133275309 @SQ SN:chr13 LN:114364328 @SQ SN:chr14 LN:107043718 @SQ SN:chr15 LN:101991189 @SQ SN:chr16 LN:90338345 @SQ SN:chr17 LN:83257441 @SQ SN:chr18 LN:80373285 @SQ SN:chr19 LN:58617616 @SQ SN:chr20 LN:64444167 @SQ SN:chr21 LN:46709983 @SQ SN:chr22 LN:50818468 @SQ SN:chrX LN:156040895 @SQ SN:chrY LN:57227415 @SQ SN:chrM LN:16569 @SQ SN:GL000008.2 LN:209709 @SQ SN:GL000009.2 LN:201709 @SQ SN:GL000194.1 LN:191469 @SQ SN:GL000195.1 LN:182896 @SQ SN:GL000205.2 LN:185591 @SQ SN:GL000208.1 LN:92689 @SQ SN:GL000213.1 LN:164239 @SQ SN:GL000214.1 LN:137718 @SQ SN:GL000216.2 LN:176608 @SQ SN:GL000218.1 LN:161147 @SQ SN:GL000219.1 LN:179198 @SQ SN:GL000220.1 LN:161802 @SQ SN:GL000221.1 LN:155397 @SQ SN:GL000224.1 LN:179693 @SQ SN:GL000225.1 LN:211173 @SQ SN:GL000226.1 LN:15008 @SQ SN:KI270302.1 LN:2274 @SQ SN:KI270303.1 LN:1942 @SQ SN:KI270304.1 LN:2165 @SQ SN:KI270305.1 LN:1472 @SQ SN:KI270310.1 LN:1201 @SQ SN:KI270311.1 LN:12399 @SQ SN:KI270312.1 LN:998 @SQ SN:KI270315.1 LN:2276 @SQ SN:KI270316.1 LN:1444 @SQ SN:KI270317.1 LN:37690 @SQ SN:KI270320.1 LN:4416 @SQ SN:KI270322.1 LN:21476 @SQ SN:KI270329.1 LN:1040 @SQ SN:KI270330.1 LN:1652 @SQ SN:KI270333.1 LN:2699 @SQ SN:KI270334.1 LN:1368 @SQ SN:KI270335.1 LN:1048 @SQ SN:KI270336.1 LN:1026 @SQ SN:KI270337.1 LN:1121 @SQ SN:KI270338.1 LN:1428 @SQ SN:KI270340.1 LN:1428 @SQ SN:KI270362.1 LN:3530 @SQ SN:KI270363.1 LN:1803 @SQ SN:KI270364.1 LN:2855 @SQ SN:KI270366.1 LN:8320 @SQ SN:KI270371.1 LN:2805 @SQ SN:KI270372.1 LN:1650 @SQ SN:KI270373.1 LN:1451 @SQ SN:KI270374.1 LN:2656 @SQ SN:KI270375.1 LN:2378 @SQ SN:KI270376.1 LN:1136 @SQ SN:KI270378.1 LN:1048 @SQ SN:KI270379.1 LN:1045 @SQ SN:KI270381.1 LN:1930 @SQ SN:KI270382.1 LN:4215 @SQ SN:KI270383.1 LN:1750 @SQ SN:KI270384.1 LN:1658 @SQ SN:KI270385.1 LN:990 @SQ SN:KI270386.1 LN:1788 @SQ SN:KI270387.1 LN:1537 @SQ SN:KI270388.1 LN:1216 @SQ SN:KI270389.1 LN:1298 @SQ SN:KI270390.1 LN:2387 @SQ SN:KI270391.1 LN:1484 @SQ SN:KI270392.1 LN:971 @SQ SN:KI270393.1 LN:1308 @SQ SN:KI270394.1 LN:970 @SQ SN:KI270395.1 LN:1143 @SQ SN:KI270396.1 LN:1880 @SQ SN:KI270411.1 LN:2646 @SQ SN:KI270412.1 LN:1179 @SQ SN:KI270414.1 LN:2489 @SQ SN:KI270417.1 LN:2043 @SQ SN:KI270418.1 LN:2145 @SQ SN:KI270419.1 LN:1029 @SQ SN:KI270420.1 LN:2321 @SQ SN:KI270422.1 LN:1445 @SQ SN:KI270423.1 LN:981 @SQ SN:KI270424.1 LN:2140 @SQ SN:KI270425.1 LN:1884 @SQ SN:KI270429.1 LN:1361 @SQ SN:KI270435.1 LN:92983 @SQ SN:KI270438.1 LN:112505 @SQ SN:KI270442.1 LN:392061 @SQ SN:KI270448.1 LN:7992 @SQ SN:KI270465.1 LN:1774 @SQ SN:KI270466.1 LN:1233 @SQ SN:KI270467.1 LN:3920 @SQ SN:KI270468.1 LN:4055 @SQ SN:KI270507.1 LN:5353 @SQ SN:KI270508.1 LN:1951 @SQ SN:KI270509.1 LN:2318 @SQ SN:KI270510.1 LN:2415 @SQ SN:KI270511.1 LN:8127 @SQ SN:KI270512.1 LN:22689 @SQ SN:KI270515.1 LN:6361 @SQ SN:KI270516.1 LN:1300 @SQ SN:KI270517.1 LN:3253 @SQ SN:KI270518.1 LN:2186 @SQ SN:KI270519.1 LN:138126 @SQ SN:KI270521.1 LN:7642 @SQ SN:KI270522.1 LN:5674 @SQ SN:KI270528.1 LN:2983 @SQ SN:KI270529.1 LN:1899 @SQ SN:KI270530.1 LN:2168 @SQ SN:KI270538.1 LN:91309 @SQ SN:KI270539.1 LN:993 @SQ SN:KI270544.1 LN:1202 @SQ SN:KI270548.1 LN:1599 @SQ SN:KI270579.1 LN:31033 @SQ SN:KI270580.1 LN:1553 @SQ SN:KI270581.1 LN:7046 @SQ SN:KI270582.1 LN:6504 @SQ SN:KI270583.1 LN:1400 @SQ SN:KI270584.1 LN:4513 @SQ SN:KI270587.1 LN:2969 @SQ SN:KI270588.1 LN:6158 @SQ SN:KI270589.1 LN:44474 @SQ SN:KI270590.1 LN:4685 @SQ SN:KI270591.1 LN:5796 @SQ SN:KI270593.1 LN:3041 @SQ SN:KI270706.1 LN:175055 @SQ SN:KI270707.1 LN:32032 @SQ SN:KI270708.1 LN:127682 @SQ SN:KI270709.1 LN:66860 @SQ SN:KI270710.1 LN:40176 @SQ SN:KI270711.1 LN:42210 @SQ SN:KI270712.1 LN:176043 @SQ SN:KI270713.1 LN:40745 @SQ SN:KI270714.1 LN:41717 @SQ SN:KI270715.1 LN:161471 @SQ SN:KI270716.1 LN:153799 @SQ SN:KI270717.1 LN:40062 @SQ SN:KI270718.1 LN:38054 @SQ SN:KI270719.1 LN:176845 @SQ SN:KI270720.1 LN:39050 @SQ SN:KI270721.1 LN:100316 @SQ SN:KI270722.1 LN:194050 @SQ SN:KI270723.1 LN:38115 @SQ SN:KI270724.1 LN:39555 @SQ SN:KI270725.1 LN:172810 @SQ SN:KI270726.1 LN:43739 @SQ SN:KI270727.1 LN:448248 @SQ SN:KI270728.1 LN:1872759 @SQ SN:KI270729.1 LN:280839 @SQ SN:KI270730.1 LN:112551 @SQ SN:KI270731.1 LN:150754 @SQ SN:KI270732.1 LN:41543 @SQ SN:KI270733.1 LN:179772 @SQ SN:KI270734.1 LN:165050 @SQ SN:KI270735.1 LN:42811 @SQ SN:KI270736.1 LN:181920 @SQ SN:KI270737.1 LN:103838 @SQ SN:KI270738.1 LN:99375 @SQ SN:KI270739.1 LN:73985 @SQ SN:KI270740.1 LN:37240 @SQ SN:KI270741.1 LN:157432 @SQ SN:KI270742.1 LN:186739 @SQ SN:KI270743.1 LN:210658 @SQ SN:KI270744.1 LN:168472 @SQ SN:KI270745.1 LN:41891 @SQ SN:KI270746.1 LN:66486 @SQ SN:KI270747.1 LN:198735 @SQ SN:KI270748.1 LN:93321 @SQ SN:KI270749.1 LN:158759 @SQ SN:KI270750.1 LN:148850 @SQ SN:KI270751.1 LN:150742 @SQ SN:KI270752.1 LN:27745 @SQ SN:KI270753.1 LN:62944 @SQ SN:KI270754.1 LN:40191 @SQ SN:KI270755.1 LN:36723 @SQ SN:KI270756.1 LN:79590 @SQ SN:KI270757.1 LN:71251 @PG ID:STAR PN:STAR VN:2.7.6a CL:/home/matteo/Scrivania/iCLIP_NF90/BIN/STAR-2.7.6a/bin/Linux_x86_64/STAR --runMode alignReads --runThreadN 6 --genomeDir GENOME/ --readFilesIn FASTQ/Rt9clip.fastq.gz --readFilesCommand zcat --outFileNamePrefix ALIGN/Rt9 --outSAMtype BAM SortedByCoordinate --outSAMunmapped Within KeepPairs --outSAMprimaryFlag AllBestScore --outFilterMultimapNmax 10 --outFilterMismatchNmax 2 --alignEndsType EndToEnd --sjdbGTFfile GENOME/gencode.v35.basic.annotation.gtf @CO user command line: /home/matteo/Scrivania/iCLIP_NF90/BIN/STAR-2.7.6a/bin/Linux_x86_64/STAR --runMode alignReads --runThreadN 6 --genomeDir GENOME/ --readFilesIn FASTQ/Rt9clip.fastq.gz --outFileNamePrefix ALIGN/Rt9 --outSAMprimaryFlag AllBestScore --outFilterMultimapNmax 10 --outFilterMismatchNmax 2 --alignEndsType EndToEnd --outSAMtype BAM SortedByCoordinate --outSAMunmapped Within KeepPairs --sjdbGTFfile GENOME/gencode.v35.basic.annotation.gtf --readFilesCommand zcat
It looks like the fasta used to generate the STAR index doesn't match what clipper expects for those chromosomes, do you know where the fasta reference was downloaded from? As a potential workaround, is it possible to map to the ENCODE reference? https://www.encodeproject.org/files/GRCh38_no_alt_analysis_set_GCA_000001405.15/
I aligned my fastq files to the reference GENCODE dataset. Btw, 1 hour ago I rerun with species = GRCh38_v29e instead of GRCh38... Clipper is now running... How long it should take?
Matteo Floris, PhD - Univ. Sassari
Il mar 13 ott 2020, 20:33 Brian Yee notifications@github.com ha scritto:
It looks like the fasta used to generate the STAR index doesn't match what clipper expects for those chromosomes, do you know where the fasta reference was downloaded from? As a potential workaround, is it possible to map to the ENCODE reference? https://www.encodeproject.org/files/GRCh38_no_alt_analysis_set_GCA_000001405.15/
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/YeoLab/clipper/issues/87#issuecomment-707930695, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACZ5JNEDXZDVVJPRJGTEETDSKSMOBANCNFSM4R7NT6YQ .
Clipper usually takes all day on our 16-core nodes, faster on 32-cores. So it will depend on your setup, but I wouldn't wait around for it to finish
After 1-night-long run, another similar error:
"ValueError: invalid contig chr14_GL000009v2_random
".
Is there any way to exclude those non-canonical contigs from the analysis? There is no other workaround than mapping to the ENCODE ref?
Thank you so much for your support! Matteo
Il giorno mar 13 ott 2020 alle ore 21:08 Brian Yee notifications@github.com ha scritto:
Clipper usually takes all day on our 16-core nodes, faster on 32-cores. So it will depend on your setup, but I wouldn't wait around for it to finish
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/YeoLab/clipper/issues/87#issuecomment-707949380, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACZ5JNDBBHQBLEHI4YEULN3SKSQSPANCNFSM4R7NT6YQ .
--
Matteo Floris, PhD Associate Professor - Medical Genetics Department of Biomedical Sciences University of Sassari (Italy) Tel +39.333.48.57.679 email: mfloris@uniss.it, matteo.floris@gmail.com
Hm, the only other ways I can think of is to create a new reference set and re-install Clipper (see instructions here). Or, to hack a bit, remove the offending annotations on chr14_GL000009v2_random from these files and re-install:
https://github.com/YeoLab/clipper/blob/master/clipper/data/GRCh38_v29e.AS.STRUCTURE.COMPILED.gff https://github.com/YeoLab/clipper/blob/master/clipper/data/regions/GRCh38_v29e_genes.bed https://github.com/YeoLab/clipper/blob/master/clipper/data/regions/GRCh38_v29e_exons.bed
It looks like the fasta used to generate the STAR index doesn't match what clipper expects for those chromosomes, do you know where the fasta reference was downloaded from? As a potential workaround, is it possible to map to the ENCODE reference? https://www.encodeproject.org/files/GRCh38_no_alt_analysis_set_GCA_000001405.15/
I mapped to the ENCODE ref and now it works...
I came across this after running into the same issue (I used GENCODE reference during mapping but Clipper wants chromosome names to be in UCSC naming convention). My solution was just to use samtools to replace the header of the alignment file with one containing the correct names. As long as you don't reorder the header and only do substitutions then it shouldn't mess anything up since the BAM file uses integer indexes instead of the actual chromosome names. So steps would be -
samtools view -H alignment_oldNames.bam > header.sam
Convert the names (if using a file where 1st column is 'oldName' and 2nd column is 'newName' like those from dpryan79/ChromosomeMappings, could do something in bash like:
#!/bin/sh
while read a b
do
sed -i "s/$a/$b/g" header.sam
done < conversionTable.txt
Replace the header:
samtools reheader header.sam alignment_oldNames.bam > alignment_newNames.bam
The readme suggests that GRCh38 is one of the reference genomes that clipper supports natively. But running:
clipper -b /data/sample1.bam -s GRCh38 -o /data/sample1.bed
gives the following error: