FelixKrueger / IMPLICON

A processing guide for IMPLICON data (bisulfite amplicon data for imprinted loci)
GNU General Public License v3.0
2 stars 0 forks source link

filtering CpG context files generates empty file #1

Closed paulocaldas closed 1 year ago

paulocaldas commented 1 year ago

Hi Felix,

I download a few samples from the implication paper (mouse non allelic specific) and I manage to reproduce all the steps in your pipeline, except for the last one. This command:

./filter_coordinates_mouse_not_allele_specific.py CpG_imprinted_positions_mouse.txt

generates a empty file (only with the header).

I running the script inside the folder containing the outputs from the deduplication step, and the script identifies all the files: 'CpG_OT_SRR11207804.sra_1_8bp_UMI_R1_val_1_bismark_bt2_pe.deduplicated.sam.gz.txt.gz', 'CpG_OB_SRR11207802.sra_1_8bp_UMI_R1_val_1_bismark_bt2_pe.deduplicated.sam.gz.txt.gz', (...)

but then the methylation_state_consistency.txt is empty image

any ideas what the problem might be?

thanks a lot in advance!

FelixKrueger commented 1 year ago

The problem usually has do do with the regular expression that tries to identify relevant parts of the input files. If you could find this line:

https://github.com/FelixKrueger/IMPLICON/blob/c826e7e2d75404f28f42b1325986f1d028769dc7/filter_coordinates_mouse_not_allele_specific.py#L120

and change it to the structure your files it might just work. Maybe something like this:


# CpG_OT_SRR11207804.sra_1_8bp_UMI_R1_val_1_bismark_bt2_pe.deduplicated.sam.gz.txt.gz
pattern = 'CpG_.+_(.*).sra_*'
paulocaldas commented 1 year ago

I changed the pattern according to my file names (they are a bit different from above now), but still doesn't work:

# CpG_OT_SRR11207825_1_8bp_UMI_R1_val_1_bismark_bt2_pe.deduplicated.txt.gz
pattern = 'CpG_.+_(.*)_.*'

In any case, I'm a bit confuse with the rationale. I assumed that once the script recognizes all files according to the pattern "CpG_*.txt.gz" , everything else would work ...

FelixKrueger commented 1 year ago

Hmm, so are you saying that the script runs now fine without any issues, but the file methylation_consistency.txt remains empty?

Is there a chance that you used a genome from USCD or NCBI, and a the annotations from Ensembl? This might be a reason why it doesn't find anything.

Would you be able to share one or two CpG files with me to take a look?

paulocaldas commented 1 year ago

The script was actually running before changing the pattern name according to my files, since it can recognize the right files in the folder with general pattern "CpG_*.txt.gz".

I always use genomes and annotations (human and mouse) from Gencode https://www.gencodegenes.org/. But along the pipeline we never have to use an annotation file as input ... the "filter_coordinates_mouse_not_allele_specific.py" script takes all the CpG files and the "CpG_imprinted_positions_mouse.txt" you provide as input.

The only thing I can see different is that your input file has the first letter in "Chr" as uppercase.

CpG_OB_SRR11207820_1_8bp_UMI_R1_val_1_bismark_bt2_pe.deduplicated.txt.gz CpG_OB_SRR11207825_1_8bp_UMI_R1_val_1_bismark_bt2_pe.deduplicated.txt.gz

FelixKrueger commented 1 year ago

Ah, that's it. The annotation file uses the chromosome names from Ensembl (e.g. 2, 4, X, MT), but yours use the NCBI way, which would be chr2, chr4, chrX and chrM. Here hare the first few lines of your data:

Bismark methylation extractor version v0.24.0
SRR11207820.43_MS7_22842:1:1101:2262:12686/1:GGGGGGGG_length=150:TAAAAAAA   -   chr5    146951513   z
SRR11207820.43_MS7_22842:1:1101:2262:12686/1:GGGGGGGG_length=150:TAAAAAAA   +   chr5    146951470   Z
SRR11207820.43_MS7_22842:1:1101:2262:12686/1:GGGGGGGG_length=150:TAAAAAAA   -   chr5    146951330   z
SRR11207820.280_MS7_22842:1:1101:3512:8678/1:GGGGGGGG_length=150:TAAAAAAA   +   chr20   44395178    Z
SRR11207820.280_MS7_22842:1:1101:3512:8678/1:GGGGGGGG_length=150:TAAAAAAA   +   chr20   44395173    Z
SRR11207820.280_MS7_22842:1:1101:3512:8678/1:GGGGGGGG_length=150:TAAAAAAA   +   chr20   44395145    Z
SRR11207820.280_MS7_22842:1:1101:3512:8678/1:GGGGGGGG_length=150:TAAAAAAA   -   chr20   44395093    z
SRR11207820.280_MS7_22842:1:1101:3512:8678/1:GGGGGGGG_length=150:TAAAAAAA   +   chr20   44395063    Z
SRR11207820.831_MS7_22842:1:1101:5392:21599/1:GTGTTAGG_length=150:CTCTATTA  -   chr8    93741493    z
SRR11207820.831_MS7_22842:1:1101:5392:21599/1:GTGTTAGG_length=150:CTCTATTA  -   chr8    93741482    z
SRR11207820.831_MS7_22842:1:1101:5392:21599/1:GTGTTAGG_length=150:CTCTATTA  -   chr8    93741471    z

and here are the first few lines of the annotation file (column 2 is the relevant one):

Probe   Chromosome  Start   End Feature
Chr2:152686786-152686786    2   152686786   152686786   H13
Chr2:152686810-152686810    2   152686810   152686810   H13
Chr2:152686813-152686813    2   152686813   152686813   H13
Chr2:152686844-152686844    2   152686844   152686844   H13

The solution therefore is to either rename all chromosome names in the annotations to start with chr, or delete it from your CpG files. Either can be done outside, or within the Python script itself. I hope it'll work now!

paulocaldas commented 1 year ago

Yes! That was the issue. Also tested the R script for plotting and it's also working.

Thanks!