Trouble with chr prefix

BackofenLab / Peakhood

Individual site context extraction for CLIP-Seq peak regions

MIT License

3 stars 3 forks source link

Trouble with chr prefix #5

Open mirax87 opened 2 years ago

mirax87 commented 2 years ago

Hi everyone,

I am curious to use your tool on our in-house iCLIP data set.

After a very convenient installation (conda installation, peakhood v0.3), I executed the 'peakhood extract' and ran into a problem with the chr prefix, i.e. 'chr4 not found'. All the files I am using (BED, BAM, GTF, 2bit)come without the 'chr' prefix. I saw in the code, that for standardization a 'chr' prefix is added, but only for the 2bit genome and not for the query coordinates.

For running peakhood, I am using the output from the iCount pipeline (peak clusters in BED format, BAM files). the GTF is filtered, such that it contains only the canonical chromosomes (dm6, autosomes + X chromosome).

below is the error message.

Best, -Michael

(peakhood) $ peakhood extract --in data/peaks/iCLIP.bed  --bam data/bam_files/iCLIP.bam --bam-pp-mode 3 --read-pos-mode 2 --gtf data/resources/dm6_ensembl96.filtered.gtf --gen data/resources/dm6.2bit --report --out Testrun/iCLIP.test_run --new-site-id RBP

                               $$\
                               $$ |
  $$$$$$\   $$$$$$\   $$$$$$\  $$ |  $$\
 $$  __$$\ $$  __$$\  \____$$\ $$ | $$  |
 $$ /  $$ |$$$$$$$$ | $$$$$$$ |$$$$$$  /
 $$ |  $$ |$$   ____|$$  __$$ |$$  _$$<
 $$$$$$$  |\$$$$$$$\ \$$$$$$$ |$$ | \$$\
 $$  ____/  \_______| \_______|\__|  \__|
 $$ |                                $$\
 $$ |                                $$ |
 $$$$$$$\   $$$$$$\   $$$$$$\   $$$$$$$ |
 $$  __$$\ $$  __$$\ $$  __$$\ $$  __$$ |
 $$ |  $$ |$$ /  $$ |$$ /  $$ |$$ /  $$ |
 $$ |  $$ |$$ |  $$ |$$ |  $$ |$$ |  $$ |
 $$ |  $$ |\$$$$$$  |\$$$$$$  |\$$$$$$$ |
 \__|  \__| \______/  \______/  \_______|

Running for you in EXTRACT mode ... 
Get chromosome IDs from --gen ... 
# --in sites pre-filtering:      56482
# --in sites > --max-len:        46
# --in sites non-std chrom IDs:  42028
# --in sites post-filtering:     14444
Check minus-strand exon order in --gtf ... 
Extract transcript regions from --gtf ... 
Get transcript regions overlapping with --in sites ... 
Transcript regions overlapping with --in sites:  3311
--in sites overlapping with transcript regions:  14444
Get --in sites not overlapping with transcript regions ... 
--in sites not overlapping with transcript regions:  0
Extract unique exon regions from transcript regions ... 
Get --in regions overlapping with exon regions ... 
Get genomic site sequences ... 
Traceback (most recent call last):
  File "<miniconda3>/envs/peakhood/bin/peakhood", line 5770, in <module>
    main_extract(args)
  File "<miniconda3>/envs/peakhood/bin/peakhood", line 1140, in main_extract
    site_seqs_dic = hoodlib.get_extended_gen_seqs(args, id2row_dic,
  File "<miniconda3>/envs/peakhood/lib/python3.8/site-packages/peakhood/hoodlib.py", line 6157, in get_extended_gen_seqs
    bed_extract_sequences_from_2bit(zero_sc_tmp_bed,
  File "<miniconda3>/envs/peakhood/lib/python3.8/site-packages/peakhood/hoodlib.py", line 240, in bed_extract_sequences_from_2bit
    assert error == False, "twoBitToFa is complaining:\n%s\n%s" %(check_cmd, output)
AssertionError: twoBitToFa is complaining:
twoBitToFa -bed=Testrun/iCLIP.test_run/7b3406ea-4077-11ec-bcd9-1418773e4b8d.zero_sc.tmp.bed data/resources/dm6.2bit Testrun/iCLIP.test_run/7b341324-4077-11ec-bcd9-1418773e4b8d.zero_sc.tmp.fa
chr4 is not in data/resources/dm6.2bit

michauhl commented 2 years ago

Hi Michael,

hmm, sounds to me as if the dm6.2bit does not have the "chr4" chromosome. You can check by outputting all chromosome IDs in the 2bit file to a file by:

twoBitInfo dm6.2bit chr_info.out

I checked the dm6.2bit from here, and this one contains the "chr4", so you could try this.

Let me know if there are any more problems. We did not have the chance yet to test Peakhood on other datasets (e.g. iCLIP) and peak calling pipelines. So far only eCLIP data and CLIPper peaks ... Will be interesting to know how it works there, and if we can do something to improve it.

Best, Michael :)

michauhl commented 2 years ago

You could also try --read-pos-mode 1 (default), in my tests this worked better than --read-pos-mode 2 (but could be different due to protocol or peak caller).

michauhl commented 2 years ago

Also it looks like a lot of your input BED sites get filtered out due to non-standard chromosome ID. I assume this is due to the chromosome names in dm6, which are different from hg38 or mm10. E.g. "chr2L", "chr2R" .. I'm pretty sure these will be removed too (have to update the filtering method for this). Also there is no "chr1" ... Do you know if the following are the Drosophila standard chromosomes:

chr2L chr2R chr3L chr3R chr4 chrM

Unfortunately chromsome names are always a bit of a struggle due to different naming conventions ...

mirax87 commented 2 years ago

Hi Michael,

thank you for your quick response.

The dm6.2bit you are using is a correct version, however in our in-house setup, we discard the chr prefix and rerun the 2bit generation (see output below). This has the effect, that all my input data to peakhood extract come without the 'chr' prefix and thus, I surprising to have it pop up in the error message.

Further, I use --read-pos-mode 2 to make sure that the 5' end of the iCLIP read is taken into account (single-end reads), because that's the truncated and informative cross-link site.

In general, I am very happy with the iCount pipeline and it seems straight forward to plug it in to your program, as it produces STAR alignments (BAM), calls peaks (single-nucleotide) and clusters them (regions, peak clusters are my input to peakhood).

best, -Michael

Below are the chromosome names for iCLIP.bed, dm6.2bit and dm6_ensembl96.filtered.gtf. The dm6.2bit is a derivative of the original dm6.fasta, which was used for the STAR alignment of the iCLIP data.

$ cat data/peaks/iCLIP.bed | cut -f 1 | sort | uniq -c 
   8532 2L
   9848 2R
  11140 3L
  12508 3R
   2284 4
  12170 X

$ twoBitInfo data/resources/dm6.2bit /dev/stdout | head
2L  23513712
2R  25286936
3L  28110227
3R  32079331
4   1348131
X   23542271
Y   3667352
dmel_mitochondrion_genome   19517
Unmapped_Scaffold_8 88768
3Cen_mapped_Scaffold_31 87365
[...]

$ cut -f 1 data/resources/dm6_ensembl96.filtered.gtf | grep -v '^#'  | sort | uniq -c
  91770 2L
 104547 2R
  95377 3L
 119388 3R
   7542 4
  92148 X
$

michauhl commented 2 years ago

Okay I see. So far Peakhood converts GTF and BED input files to "chr" format, which is what we had in the test BAM + 2BIT files as well. This is just to ensure all IDs are compatible + to prevent crashes and weird results. I have added support for the Drosophila chromosome IDs (So far on branch phv04). For you to make it work you will also need to set --chr-id-style 2 (new option), otherwise Peakhood by default will convert your GTF+BED to "chr1" style, but your BAM+2BIT are "1" style, and then it will complain about incompatible chromosome IDs.

Regarding the --read-pos-mode option: I think it's okay to use --read-pos-mode 1 in your case too, since this option refers to the coverage calculation, not the peak regions (there you have crosslink positions I assume, but Peakhood does not change these by default). Whole reads can be informative, e.g. when they connect two exons (intron spanning reads), where it makes sense to count them twice.

If you have some test files for your "1" style case (BAM BED 2BIT GTF), or some links where to get some, I could also test whether Peakhood works with the new option.

mirax87 commented 2 years ago

That was a fast fix. :-) Thank you.

Good news: It moved good step further! Bad news: it crashes at a later stage.

Here is what I did:

I installed the phv04 branch (the installation doc is great!) and re-ran peakhood extract adding the --chr-id-style 2.

The output

a) reported no more mismatches in the stats part (I consider this a success) and
b) failed with the complaint that --bam and --gen file contain incompatible chroms or order of chroms.

For b) I checked the BAM header and the .2bit chroms and the sets are identical in content and order (using bash diff, not shown).

I suspect that conversion step takes place.

For more details, check the log below.

Here is the output:


$ cat batch.log 

                               $$\
                               $$ |
  $$$$$$\   $$$$$$\   $$$$$$\  $$ |  $$\
 $$  __$$\ $$  __$$\  \____$$\ $$ | $$  |
 $$ /  $$ |$$$$$$$$ | $$$$$$$ |$$$$$$  /
 $$ |  $$ |$$   ____|$$  __$$ |$$  _$$<
 $$$$$$$  |\$$$$$$$\ \$$$$$$$ |$$ | \$$\
 $$  ____/  \_______| \_______|\__|  \__|
 $$ |                                $$\
 $$ |                                $$ |
 $$$$$$$\   $$$$$$\   $$$$$$\   $$$$$$$ |
 $$  __$$\ $$  __$$\ $$  __$$\ $$  __$$ |
 $$ |  $$ |$$ /  $$ |$$ /  $$ |$$ /  $$ |
 $$ |  $$ |$$ |  $$ |$$ |  $$ |$$ |  $$ |
 $$ |  $$ |\$$$$$$  |\$$$$$$  |\$$$$$$$ |
 \__|  \__| \______/  \______/  \_______|

Running for you in EXTRACT mode ... 
Get chromosome IDs from --gen ... 
# --in sites pre-filtering:      56482
# --in sites > --max-len:        46
# --in sites post-filtering:     56436
Check minus-strand exon order in --gtf ... 
Extract transcript regions from --gtf ... 
Get transcript regions overlapping with --in sites ... 
Transcript regions overlapping with --in sites:  14732
--in sites overlapping with transcript regions:  56436
Get --in sites not overlapping with transcript regions ... 
--in sites not overlapping with transcript regions:  0
Extract unique exon regions from transcript regions ... 
Get --in regions overlapping with exon regions ... 
Get genomic site sequences ... 
Processing --bam files ... 
Filter --bam file to keep only R1 reads ... 
Traceback (most recent call last):
  File "<miniconda3>/envs/peakhood/bin/peakhood", line 5833, in <module>
    main_extract(args)
  File "<miniconda3>/envs/peakhood/bin/peakhood", line 1290, in main_extract
    assert chr_id in chr_len_dic, "first chromosome ID \"%s\" in --bam file not in --gen .2bit file. Please provide compatible --bam and --gen files (i.e., format of chromosome IDs should be the same)" %(chr_id)
AssertionError: first chromosome ID "" in --bam file not in --gen .2bit file. Please provide compatible --bam and --gen files (i.e., format of chromosome IDs should be the same)

michauhl commented 2 years ago

Great thanks for the feedback :) Hmm okay .. it could be that the R1 read filtering doesn't work for your BAM file (i.e., no more reads left after filtering). I fixed some more things (on phv04), also to throw some more informative errors, maybe it already works now for you :) If not you can also check whether you still get reads after running

samtools view -hb -f 66 your.bam

on your BAM file (R1 read filtering, same command Peakhood uses). This should be the correct way for iCLIP ..

mirax87 commented 2 years ago

Hi Michael,

sorry for coming back to you with so much delay. Other projects came into focus...

The thing with the iCLIP library, that I am working with is that it's a single-end sequencing library. Since the flag 66 checks for 'proper pairs', it returns no reads.

Installation, command and log are below.

best

To update the peakhood installation, I followed the instructions from git clone ... in https://github.com/BackofenLab/Peakhood#manual-installation

Below are execution command and log

Command:

peakhood extract --chr-id-style 2  --bam-pp-mode 3 --read-pos-mode 2 --new-site-id RBP --report
--in data/peaks/iCLIP.bed --bam data/bam_files/iCLIP.bam --gtf data/resources/dm6_ensembl96.filtered.gtf 
--gen data/resources/dm6.2bit 
--out Testrun/iCLIP.test_run

LOG:

$ bash batch.sh 

                               $$\
                               $$ |
  $$$$$$\   $$$$$$\   $$$$$$\  $$ |  $$\
 $$  __$$\ $$  __$$\  \____$$\ $$ | $$  |
 $$ /  $$ |$$$$$$$$ | $$$$$$$ |$$$$$$  /
 $$ |  $$ |$$   ____|$$  __$$ |$$  _$$<
 $$$$$$$  |\$$$$$$$\ \$$$$$$$ |$$ | \$$\
 $$  ____/  \_______| \_______|\__|  \__|
 $$ |                                $$\
 $$ |                                $$ |
 $$$$$$$\   $$$$$$\   $$$$$$\   $$$$$$$ |
 $$  __$$\ $$  __$$\ $$  __$$\ $$  __$$ |
 $$ |  $$ |$$ /  $$ |$$ /  $$ |$$ /  $$ |
 $$ |  $$ |$$ |  $$ |$$ |  $$ |$$ |  $$ |
 $$ |  $$ |\$$$$$$  |\$$$$$$  |\$$$$$$$ |
 \__|  \__| \______/  \______/  \_______|

Running for you in EXTRACT mode ... 
Get chromosome IDs from --gen ... 
# --in sites pre-filtering:      56482
# --in sites > --max-len:        46
# --in sites post-filtering:     56436
Check minus-strand exon order in --gtf ... 
Extract transcript regions from --gtf ... 
Get transcript regions overlapping with --in sites ... 
Transcript regions overlapping with --in sites:  14732
--in sites overlapping with transcript regions:  56436
Get --in sites not overlapping with transcript regions ... 
--in sites not overlapping with transcript regions:  0
Extract unique exon regions from transcript regions ... 
Get --in regions overlapping with exon regions ... 
Get genomic site sequences ... 
Processing --bam files ... 
Filter --bam file to keep only R1 reads ... 
Traceback (most recent call last):
  File "/localenv/rauer/miniconda3/envs/peakhood/bin/peakhood", line 5839, in <module>
    main_extract(args)
  File "/localenv/rauer/miniconda3/envs/peakhood/bin/peakhood", line 1289, in main_extract
    assert bam_chr_ids_dic, "no chromsome IDs found in BAM file. Possibly the set --bam-pp-mode resulted in an empty BAM file. Filtering your BAM file with \"samtools view -hb -f 130\" (--bam-pp-mode 2, to obtain R2 reads) or \"samtools view -hb -f 66\" (--bam-pp-mode 3, to obtain R1 reads) should produce a non-empty file"
AssertionError: no chromsome IDs found in BAM file. Possibly the set --bam-pp-mode resulted in an empty BAM file. Filtering your BAM file with "samtools view -hb -f 130" (--bam-pp-mode 2, to obtain R2 reads) or "samtools view -hb -f 66" (--bam-pp-mode 3, to obtain R1 reads) should produce a non-empty file

michauhl commented 2 years ago

Hi Michael,

thanks for the feedback! Sorry somehow I didn't get notifications on issues here, should work now .. Yes, the log looks like from the latest version. I see, okay. I never tested it with single-end data. Maybe you can just use --bam-pp-mode 1 (default)? This way there should be no read filtering applied. If it still crashes, let me know, I'm curious :) Once this works I could generate a new release.

Best, Michael

mirax87 commented 2 years ago

Hi Michael,

sorry for the delayed answer - I went on longer holidays.

Now, I tried the --bam-pp-mode 1 and it fails on a different level now, something related to 'exon borders'. Not sure, where it is coming from, though. See error log below.

best

Error log:


                               $$\
                               $$ |
  $$$$$$\   $$$$$$\   $$$$$$\  $$ |  $$\
 $$  __$$\ $$  __$$\  \____$$\ $$ | $$  |
 $$ /  $$ |$$$$$$$$ | $$$$$$$ |$$$$$$  /
 $$ |  $$ |$$   ____|$$  __$$ |$$  _$$<
 $$$$$$$  |\$$$$$$$\ \$$$$$$$ |$$ | \$$\
 $$  ____/  \_______| \_______|\__|  \__|
 $$ |                                $$\
 $$ |                                $$ |
 $$$$$$$\   $$$$$$\   $$$$$$\   $$$$$$$ |
 $$  __$$\ $$  __$$\ $$  __$$\ $$  __$$ |
 $$ |  $$ |$$ /  $$ |$$ /  $$ |$$ /  $$ |
 $$ |  $$ |$$ |  $$ |$$ |  $$ |$$ |  $$ |
 $$ |  $$ |\$$$$$$  |\$$$$$$  |\$$$$$$$ |
 \__|  \__| \______/  \______/  \_______|

Running for you in EXTRACT mode ... 
Get chromosome IDs from --gen ... 
# --in sites pre-filtering:      56482
# --in sites > --max-len:        46
# --in sites post-filtering:     56436
Check minus-strand exon order in --gtf ... 
Extract transcript regions from --gtf ... 
Get transcript regions overlapping with --in sites ... 
Transcript regions overlapping with --in sites:  14732
--in sites overlapping with transcript regions:  56436
Get --in sites not overlapping with transcript regions ... 
--in sites not overlapping with transcript regions:  0
Extract unique exon regions from transcript regions ... 
Get --in regions overlapping with exon regions ... 
Get genomic site sequences ... 
Processing --bam files ... 
Get intronic sites for TIS filtering ... 
Get --in regions overlapping with intron regions ... 
# of intronic sites for TIS filtering:  21636
Merge transcript regions BED ... 
# of exonic sites:                34473
# transcripts with exonic sites:  13769
Extract BAM reads from transcript regions containing exonic sites ... 
# of overlapping BAM reads:  25701534
Get intron-spanning read stats ... 
Get isolated single exon genes ... 
# of isolated single exon genes:  499
Remove low-evidence overlapping genes ... 
# of overlapping genes to remove:  208
# of associated transcripts:       233
Get exon-intron coverage of exonic site transcripts ... 
Concatenate ISR reads BED to transcript reads BED ... 
Output transcript regions in 12-column BED format ... 
Get neighboring exon ISR counts ... 
Remove exons fully overlapping with ISR containing introns ... 
# of NEXT IDs removed due to full overlap with ISR introns:  650
Check transcript context for each exonic site ... 
# exonic sites assigned to genomic context (TRS filter):      6510
# exonic sites assigned to genomic context (EIR WT filter):   10528
# exonic sites assigned to genomic context (EIBR WT filter):  1956
# exonic sites assigned to genomic context (EIR filter):      7597
# exonic sites assigned to genomic context (EIBR filter):     187
# exonic sites assigned to genomic context (TIS filter):      2716
# of overlapping transcripts > --min-ei-ratio:    2680
# of exonic sites (assigned transcript context):  4979
# of exonic sites (assigned genome context):      29494
Get adjacent sites at exon borders ... 
Traceback (most recent call last):
  File "<miniconda3>/envs/peakhood/bin/peakhood", line 5839, in <module>
    main_extract(args)
  File "<miniconda3>/envs/peakhood/bin/peakhood", line 2407, in main_extract
    id2ids_dic = hoodlib.rem_exb_pairs_exb_dist(id2ids_dic, id2exids_dic,
  File "<miniconda3>/envs/peakhood/lib/python3.8/site-packages/peakhood/hoodlib.py", line 483, in rem_exb_pairs_exb_dist
    cp_diff = id2gen_cp_dic[sid1] - id2gen_cp_dic[sid2]
KeyError: 'RBP_33782'

michauhl commented 2 years ago

Hi Michael, thanks for the feedback, and sorry I was still not getting notifications! I assigned myself to the issue, this must be it ... I pushed some small bug fix on phv04 branch (I assume the error comes from this). If there are still errors, as usual let me know! :) Also if possible you can send me the files you're testing with. So that I can run some more tests to find any other hidden bugs.

Best, Michael