Closed paulocaldas closed 1 year ago
The problem usually has do do with the regular expression that tries to identify relevant parts of the input files. If you could find this line:
and change it to the structure your files it might just work. Maybe something like this:
# CpG_OT_SRR11207804.sra_1_8bp_UMI_R1_val_1_bismark_bt2_pe.deduplicated.sam.gz.txt.gz
pattern = 'CpG_.+_(.*).sra_*'
I changed the pattern according to my file names (they are a bit different from above now), but still doesn't work:
# CpG_OT_SRR11207825_1_8bp_UMI_R1_val_1_bismark_bt2_pe.deduplicated.txt.gz
pattern = 'CpG_.+_(.*)_.*'
In any case, I'm a bit confuse with the rationale. I assumed that once the script recognizes all files according to the pattern "CpG_*.txt.gz" , everything else would work ...
Hmm, so are you saying that the script runs now fine without any issues, but the file methylation_consistency.txt
remains empty?
Is there a chance that you used a genome from USCD or NCBI, and a the annotations from Ensembl? This might be a reason why it doesn't find anything.
Would you be able to share one or two CpG files with me to take a look?
The script was actually running before changing the pattern name according to my files, since it can recognize the right files in the folder with general pattern "CpG_*.txt.gz".
I always use genomes and annotations (human and mouse) from Gencode https://www.gencodegenes.org/. But along the pipeline we never have to use an annotation file as input ... the "filter_coordinates_mouse_not_allele_specific.py" script takes all the CpG files and the "CpG_imprinted_positions_mouse.txt" you provide as input.
The only thing I can see different is that your input file has the first letter in "Chr" as uppercase.
CpG_OB_SRR11207820_1_8bp_UMI_R1_val_1_bismark_bt2_pe.deduplicated.txt.gz CpG_OB_SRR11207825_1_8bp_UMI_R1_val_1_bismark_bt2_pe.deduplicated.txt.gz
Ah, that's it. The annotation file uses the chromosome names from Ensembl (e.g. 2
, 4
, X
, MT
), but yours use the NCBI way, which would be chr2
, chr4
, chrX
and chrM
.
Here hare the first few lines of your data:
Bismark methylation extractor version v0.24.0
SRR11207820.43_MS7_22842:1:1101:2262:12686/1:GGGGGGGG_length=150:TAAAAAAA - chr5 146951513 z
SRR11207820.43_MS7_22842:1:1101:2262:12686/1:GGGGGGGG_length=150:TAAAAAAA + chr5 146951470 Z
SRR11207820.43_MS7_22842:1:1101:2262:12686/1:GGGGGGGG_length=150:TAAAAAAA - chr5 146951330 z
SRR11207820.280_MS7_22842:1:1101:3512:8678/1:GGGGGGGG_length=150:TAAAAAAA + chr20 44395178 Z
SRR11207820.280_MS7_22842:1:1101:3512:8678/1:GGGGGGGG_length=150:TAAAAAAA + chr20 44395173 Z
SRR11207820.280_MS7_22842:1:1101:3512:8678/1:GGGGGGGG_length=150:TAAAAAAA + chr20 44395145 Z
SRR11207820.280_MS7_22842:1:1101:3512:8678/1:GGGGGGGG_length=150:TAAAAAAA - chr20 44395093 z
SRR11207820.280_MS7_22842:1:1101:3512:8678/1:GGGGGGGG_length=150:TAAAAAAA + chr20 44395063 Z
SRR11207820.831_MS7_22842:1:1101:5392:21599/1:GTGTTAGG_length=150:CTCTATTA - chr8 93741493 z
SRR11207820.831_MS7_22842:1:1101:5392:21599/1:GTGTTAGG_length=150:CTCTATTA - chr8 93741482 z
SRR11207820.831_MS7_22842:1:1101:5392:21599/1:GTGTTAGG_length=150:CTCTATTA - chr8 93741471 z
and here are the first few lines of the annotation file (column 2 is the relevant one):
Probe Chromosome Start End Feature
Chr2:152686786-152686786 2 152686786 152686786 H13
Chr2:152686810-152686810 2 152686810 152686810 H13
Chr2:152686813-152686813 2 152686813 152686813 H13
Chr2:152686844-152686844 2 152686844 152686844 H13
The solution therefore is to either rename all chromosome names in the annotations to start with chr
, or delete it from your CpG files. Either can be done outside, or within the Python script itself. I hope it'll work now!
Yes! That was the issue. Also tested the R script for plotting and it's also working.
Thanks!
Hi Felix,
I download a few samples from the implication paper (mouse non allelic specific) and I manage to reproduce all the steps in your pipeline, except for the last one. This command:
./filter_coordinates_mouse_not_allele_specific.py CpG_imprinted_positions_mouse.txt
generates a empty file (only with the header).
I running the script inside the folder containing the outputs from the deduplication step, and the script identifies all the files: 'CpG_OT_SRR11207804.sra_1_8bp_UMI_R1_val_1_bismark_bt2_pe.deduplicated.sam.gz.txt.gz', 'CpG_OB_SRR11207802.sra_1_8bp_UMI_R1_val_1_bismark_bt2_pe.deduplicated.sam.gz.txt.gz', (...)
but then the methylation_state_consistency.txt is empty
any ideas what the problem might be?
thanks a lot in advance!