JSuryatenggara / ChIP-AP

An Integrated Analysis Pipeline for Unbiased ChIP seq Analysis
20 stars 11 forks source link

How to point to custom genome #17

Closed H1889 closed 2 months ago

H1889 commented 3 months ago

hello, thank you for making such a fantastic tool. I have already tested it with yeast and it works perfectly. However now I have to use it with my custom genome. I have followed the instructions to generate the files as indicated in the guide. I have placed them in a directory but I get an error:

Checking for suite requirements Please make sure bbduk adapter sequences reference fasta file "adapter.fa" exists, inside folder named "bbmap", inside your genome directory folder Example: [YOUR GENOME DIRECTORY]/bbmap/adapters.fa Please make sure bwa aligner genome reference fasta file "hg38.fa" exists, inside folder named "bwa", inside your genome directory folder Example: [YOUR GENOME DIRECTORY]/bwa/hg38.fa Please make sure GEM default read distribution reference file "Read_Distribution_default.txt" exists, inside folder named "GEM", inside your genome directory folder Example: [YOUR GENOME DIRECTORY]/GEM/Read_Distribution_default.txt Please make sure GEM hg38 chromosome sizes reference file "hg38.chrom.sizes" exists, inside folder named "GEM", inside your genome directory folder Example: [YOUR GENOME DIRECTORY]/GEM/hg38.chrom.sizes Please make sure GEM genome reference FOLDER "hg38_Chr_FASTA" exists (containing multiple individual fasta files for every chromosome), inside folder named "GEM", inside your genome directory folder Example: [YOUR GENOME DIRECTORY]/GEM/hg38_Chr_FASTA/ Exiting suite

I have checked all the required files and they are ok.

My question is: is it necessary to put in the path the name of the custom genome?

My genome is called crassa.fa and this is my command:

chipap.py --mode single --peak narrow --output /media/yomismo/Datos/CHIPSEQ_RNASEQ/CHIPSEQ --setname DARK_CONTROL --genome /media/yomismo/Datos/CHIPSEQ_RNASEQ/CHIPSEQ/CHIP_AP_GENOME/ --custom_setting_table /media/yomismo/Datos/CHIPSEQ_RNASEQ/CHIPSEQ/setting_table.tsv --fcmerge --homer_motif union --meme_motif union --deltemp --thread 32 --run --chipR1 dark_1.fastq.gz dark_2.fastq.gz --ctrlR1 INPUT.fastq.gz

This is my genome folder:

├── bbmap │   └── adapters.fa ├── bwa │   ├── crassa.fa │   ├── crassa.fa.amb │   ├── crassa.fa.ann │   ├── crassa.fa.bwt │   ├── crassa.fa.fai │   ├── crassa.fa.pac │   └── crassa.fa.sa └── GEM ├── crassa_Chr_FASTA │   ├── NC_026501.1.fa │   ├── NC_026502.1.fa │   ├── NC_026503.1.fa │   ├── NC_026504.1.fa │   ├── NC_026505.1.fa │   ├── NC_026506.1.fa │   ├── NC_026507.1.fa │   ├── NC_026614.1.fa │   ├── NW_011929459.1.fa │   ├── NW_011929460.1.fa │   ├── NW_011929461.1.fa │   ├── NW_011929462.1.fa │   ├── NW_011929463.1.fa │   ├── NW_011929464.1.fa │   ├── NW_011929465.1.fa │   ├── NW_011929466.1.fa │   ├── NW_011929467.1.fa │   ├── NW_011929468.1.fa │   ├── NW_011929469.1.fa │   ├── NW_011929470.1.fa │   └── NW_011929471.1.fa ├── crassa.chrom.sizes ├── Read_Distribution_BP.txt ├── Read_Distribution_ChIP-exo.txt ├── Read_Distribution_CLIP.txt └── Read_Distribution_default.txt

4 directories, 34 files

Thanks

H1889 commented 3 months ago

One more thing after running my command it is displayed:

"Now processing: chipap.py --mode single --peak narrow --chipR1 dark_1.fastq.gz dark_2.fastq.gz --ctrlR1 INPUT.fastq.gz --output /media/yomismo/Datos/CHIPSEQ_RNASEQ/CHIPSEQ --setname DARK_CONTROL --ref hg38 --genome /media/yomismo/Datos/CHIPSEQ_RNASEQ/CHIPSEQ/CHIP_AP_GENOME/crassa --custom_setting_table /media/yomismo/Datos/CHIPSEQ_RNASEQ/CHIPSEQ/setting_table.tsv --fcmerge --homer_motif union --meme_motif union --deltemp --thread 32 --run"

So the program is asuming the hg38 human genome as reference

H1889 commented 3 months ago

I have discovered a possible bug in the guide to generate the custom genome folder, I think that besides the 3 folders bwa, GEM and bbmap it is necesary another folder called chrom.sizes whith the file crassa.chrom.sizes

JSuryatenggara commented 3 months ago

Hello, thank you so much for your kind comments. Please give us a few days to diagnose, fix, and test. Will get back to you as soon as possible. Cheers.

H1889 commented 3 months ago

Thank you for your help. I forgot one detail. When chip-ap runs with a custom genome it must also need the genome annotation files to be able to annotate the peaks, how can those files be generated from a GTF or GFF file? In my chip-ap installation I see a folder called:

.../share/homer/data/genomes

where there are the folders for the precalculated genomes dm6, hg19,hg38,mm10,mm9 and sacCer3. An in each of those folders the annotationa files, for instance for sacCer3: sacCer3/ ├── annotations │   ├── basic │   │   ├── centromeres.ann.txt │   │   ├── coding.ann.txt │   │   ├── cpgIsland.ann.txt │   │   ├── exons.ann.txt │   │   ├── gaps.ann.txt │   │   ├── intergenic.ann.txt │   │   ├── introns.ann.txt │   │   ├── miRNA.ann.txt │   │   ├── promoters.ann.txt │   │   ├── protein-coding.ann.txt │   │   ├── pseudo.ann.txt │   │   ├── tts.ann.txt │   │   ├── utr3.ann.txt │   │   └── utr5.ann.txt │   ├── custom │   └── repeats ├── chrI.fa ├── chrII.fa ├── chrIII.fa ├── chrIV.fa ├── chrIX.fa ├── chrM.fa ├── chrom.sizes ├── chrV.fa ├── chrVI.fa ├── chrVII.fa ├── chrVIII.fa ├── chrX.fa ├── chrXI.fa ├── chrXII.fa ├── chrXIII.fa ├── chrXIV.fa ├── chrXV.fa ├── chrXVI.fa ├── preparsed │   ├── sacCer3r.0.pos │   ├── sacCer3r.401.cgbins │   ├── sacCer3r.401.cgfreq │   ├── sacCer3r.401.gcbins │   ├── sacCer3r.401.pos │   ├── sacCer3r.401.seq │   ├── sacCer3r.507.cgbins │   ├── sacCer3r.507.cgfreq │   ├── sacCer3r.507.gcbins │   ├── sacCer3r.507.pos │   ├── sacCer3r.507.seq │   ├── sacCer3r.759.cgbins │   ├── sacCer3r.759.cgfreq │   ├── sacCer3r.759.gcbins │   ├── sacCer3r.759.pos │   ├── sacCer3r.759.seq │   ├── sacCer3r.789.cgbins │   ├── sacCer3r.789.cgfreq │   ├── sacCer3r.789.gcbins │   ├── sacCer3r.789.pos │   ├── sacCer3r.789.seq │   ├── sacCer3r.870.cgbins │   ├── sacCer3r.870.cgfreq │   ├── sacCer3r.870.gcbins │   ├── sacCer3r.870.pos │   ├── sacCer3r.870.seq │   ├── sacCer3r.916.cgbins │   ├── sacCer3r.916.cgfreq │   ├── sacCer3r.916.gcbins │   ├── sacCer3r.916.pos │   └── sacCer3r.916.seq ├── sacCer3.aug ├── sacCer3.basic.annotation ├── sacCer3.full.annotation ├── sacCer3.miRNA ├── sacCer3.repeats ├── sacCer3.rna ├── sacCer3.splice3p ├── sacCer3.splice5p ├── sacCer3.stop ├── sacCer3.tss └── sacCer3.tts

How are those files generated?

Sorry for the inconvenience. best

H1889 commented 3 months ago

I have realized that the annotation files and GO-analysis files are generated by perl scripts from the HOMER package, so, theoretically I can add those files running the scripts

JSuryatenggara commented 2 months ago

Hi, thank you for your detailed bug reports. It really helps us to improve streamlining the process for custom genomes. I have made necessary changes to the scripts to accommodate custom genome like yours, and will potentially be publishing a new ChIP-AP release in one or two days.

Since it is not a widely established reference, I am not sure the N. crassa genome (NC12 as in NCBI RefSeq version) and annotation that I got my hands on is identical to what you got over there. Therefore, for now, please download and take a look at the attached file below. Can you help to fill in (or edit, to be precise) the "chromosome" names (line 86 & 215) and their lengths (line 215). To easily check the lengths, you can use the command "faidx filename.fa -i chromsizes". Additionally, you may also edit the genome name in that file from NC12 to crassa, as that is how you name your genome reference.

GenomeData.py.gz

Once you've done that, please resend it back to me so I can test if ChIP-AP is able to process everything from start to finish.

PS: BTW, have you successfully created the necessary annotation files using the perl scripts from HOMER?

H1889 commented 2 months ago

Thank you, very,very much for your help. The chromosome names and sizes are ok, they are the same I am using. So I have not edited the 86 and 215 lines. Regarding if I have been able to get the HOMER files the answer is no, it gave me an error, surely I have not done something right. If I get it I will let you know. Regards

JSuryatenggara commented 2 months ago

crassa.zip

Hi, attached are the resources you will need to run with your ncrassa custom genome.

  1. If you are running with --peak broad, then ChIP-AP will be using SICER2. For that, you will need to replace existing GenomeData.py file in your Anaconda installation with the updated one (attached). You can use: find . -name "*anaconda*GenomeData.py" to find all instances of that file. If it is kind of confusing to be sure which one to replace, you can simply replace all instances of it since all it does is updating the list and should not break anything.
  2. Please take a look at my screenshots of my custom genome folder to help you confirm that everything is fine. Emphasis on the file naming in GEM's chromosome-wise folder, since GEM behaves weirdly when the chromosome naming format is unconventional. To easily produce that, you can use the updated chromfa_splitter.py (attached) like this: chromfa_splitter.py --fa [genome folder]/bwa/NC12.fa --output [genome folder]/GEM/NC12_Chr_FASTA/ --gem.
  3. I got HOMER to load custom NC12 genome and install its ncrassa accessions and ontology data for ChIP-AP to run everything up the the very end. What you need to do is to go into the directory where loadGenome.pl is located and run for example: ./loadGenome.pl -name NC12 -fasta [genome folder]/bwa/NC12.fa -gtf [path]/NC12.ncbiRefSeq.gtf -org ncrassa. Next, you need to go to the directory where ./configureHomer.pl is located and run: perl ./configureHomer.pl -install ncrassa.
  4. Finally, replace your chipap.py with the updated chipap.py (attached)
  5. With all that, you can now try to rerun your whole analysis by calling chipap.py with --ref NC12.
  6. When you have done that, please do let me know how it turns out. I will be releasing a new version when there is no more bug encountered.

Cheers! Thank you for helping us improve ChIP-AP!

H1889 commented 2 months ago

Thank you very much! I will try it tomorrow and let you know how it went. thank you again for making our lives easier

best

H1889 commented 2 months ago

I have followed your instructions, however it was stopped by this error:

I do not understand why it says bwa is aligning paired-end, my data are single-end, this was my command:

chipap.py --mode single --peak narrow --chipR1 dark_1.fastq.gz dark_2.fastq.gz --ctrlR1 INPUT.fastq.gz --output /media/yomismo/Datos/VOS_CHIPSEQ_RNASEQ/CHIPSEQ --setname DARK_CONTROL --ref NC12 --genome /media/yomismo/Datos/VOS_CHIPSEQ_RNASEQ/CHIPSEQ/CHIP_AP_GENOME --custom_setting_table /media/yomismo/Datos/VOS_CHIPSEQ_RNASEQ/CHIPSEQ/setting_table.tsv --fcmerge --goann --homer_motif union --meme_motif union --deltemp --thread 32 --run

Thanks

H1889 commented 2 months ago

I have detected the error, my NC12.fa.pac file was named as NC12fa.pac, anyway, why bwa says it is aligning paired-end?

Now is running again. Hope this time all goes well. Best

H1889 commented 2 months ago

It was stooped by a new error: Opening file: /media/yomismo/Datos/VOS_CHIPSEQ_RNASEQ/CHIPSEQ/DARK_CONTROL/22_peaks_processing/DARK_CONTROL_all_peaks_annotated.tsv Running the calculator on ChIP replicate DARK_CONTROL_chip_merged and DARK_CONTROL_ctrl_merged chip_uniquely_mapped_read_count: 9467721 ctrl_uniquely_mapped_read_count: 42678599 chip_normalization_factor_value: 0.22183767091323686 ctrl_normalization_factor_value: 1 Traceback (most recent call last): File "/media/yomismo/Datos/chipap_installation/chipap_scripts/fold_change_calculator.py", line 485, in 'GC%' File "/home/yomismo/mambaforge/envs/CHIP-AP/lib/python3.7/site-packages/pandas/core/frame.py", line 3030, in getitem indexer = self.loc._get_listlike_indexer(key, axis=1, raise_missing=True)[1] File "/home/yomismo/mambaforge/envs/CHIP-AP/lib/python3.7/site-packages/pandas/core/indexing.py", line 1266, in _get_listlike_indexer self._validate_read_indexer(keyarr, indexer, axis, raise_missing=raise_missing) File "/home/yomismo/mambaforge/envs/CHIP-AP/lib/python3.7/site-packages/pandas/core/indexing.py", line 1316, in _validate_read_indexer raise KeyError(f"{not_found} not in index") KeyError: "['GC%', 'CpG%'] not in index"

Any suggestion? thanks and sorry for the inconvenience

JSuryatenggara commented 2 months ago

Hello, thanks for running the pipeline and reaching back.

The paired-end notification is a mistake in my script, so don't worry about it because bwa is actually running single-end in your case. Replace your chipap.py with the one in the attachment and that trivial problem will be no more.

As for the GC% and CpG% error, replace your fold_change_calculator.py with the one in the attachment. As far as I tested, it should resolve your issue above.

crassa_2.zip

Hopefully it goes well! Please let me know how it turns up later.

H1889 commented 2 months ago

A new error at meme:

meme_sequence_extractor.py: error: argument --ref: invalid choice: 'NC12' (choose from 'hg19', 'hg38', 'mm9', 'mm10', 'dm6', 'sacCer3')

H1889 commented 2 months ago

Hi again, have you resolved the new error at meme? Thanks and sorry for the inconvenience

JSuryatenggara commented 2 months ago

Sorry, I missed the notification from your comment 5 days ago. Please replace your meme_sequence_extractor.py with the one from the attachment. Please feel free to ask further questions.

crassa_3.zip

Cheers!

H1889 commented 2 months ago

don't worry, you are already making enough effort. I will try it and let you know the results, thanks a million.

H1889 commented 2 months ago

Hi again, the run has ended as follows:

Generating background sequences for target sequences replicate 1...

Background sequences (replicate 1) has been generated: /media/yomismo/Datos/VOS_CHIPSEQ_RNASEQ/CHIPSEQ/DARK_CONTROL/25_meme_motif_enrichment/union_peak_set/background.fa Background sequences (replicate 1) stats: Number of sequences: 7559 Length of sequences: 1105 Total G or C characters: 3973966 Total A or T characters: 4378729 Individual background sequence GC content: 0.4758

Is this correct?

Greetings

JSuryatenggara commented 2 months ago

Hi, so inside the bottom last folder (25_meme_motif_enrichment), can you find a folder named motif_analysis inside your union_peak_set folder? If yes, can you find meme-chip.html inside that folder? If yes, can you see the top motif candidates when you open that file? If you can for all three things above, then the pipeline has completed successfully.

Let me know if anything still did not work or if any of the resulting outputs looks weird to you.

Cheers!

H1889 commented 2 months ago

Thank you very much. The answer is yes, I can see all those files and folders and the motifs. it seems everything is now ok. I will patiently check all the results and write you if I detect anything anomalous.

All the best

H1889 commented 2 months ago

Hi. I have checked all the folders and files and seems everything is ok. Only the GO columns are empty. Homer has no data for N. crassa linking annotation and go-terms. A detail that can improve the app is to compute the motifs in the consensus mode with the maximum of methods, i.e, in my data GEM detects no peaks but if I run in mode consensus no result is shown given that one of the methods failed: MACS2 GEM HOMER Genrich Total Name X 7145 MACS2 X X 4 MACS2|Genrich X X 361 MACS2|HOMER X X X 48 MACS2|HOMER|Genrich

in that case my motif prediction must be based on the 48 common peaks or give the possibility to select 2, 3 or 4 methods to compute motif prediction. Evenmore, you can give the option to compute peaks given a minimum FC and/or minium -logIDR. Thanks

JSuryatenggara commented 2 months ago

Hi, thanks for the feedback.

Can you recheck whether GEM returned an empty peak list (which is not so rare if peaks are rather hard to detect in general), or it somehow crashes during peak calling (which sometimes happen since it is the most resource-demanding peak caller among others)? You can check by looking at the two log files.

Looking at the number of called peaks, there is a high chance that only MACS2 managed to call peaks properly in your dataset. Since the union peak set is basically equal to your MACS2 peak set, I would check if it the motif enrichment analysis did call the expected motif (if you have a motif reference for your target).

As for doing the motif enrichment analysis with 2 or 3 peak callers, it is in our plan to implement in the future. However, with each step being carried out by modular scripts, you can easily do it yourself by opening "24_homer_motif_enrichment_consensus_script.sh" and changing the number 4 in "awk 'NR == 1 ; $7 == 4'" into 2 or 3 and rerun. That should do the trick for you. You can also generate multiple results for both 2 and 3 peak caller overlaps by simply changing the naming of the files.

If you want to filter with FC or IDR or practically anything else, you can simply open the xxxxx_all_peaks_calculated.tsv file in folder 22, use a spreadsheet application like MS Excel or LibreOffice Calc to freely filter the peaks you want, then remove all other columns except column 1 to 5, save it, and feed it directly to findMotifsGenome.pl (see "24_homer_motif_enrichment_consensus_script.sh" for an example; remember to modify the filenames as needed)

Now for meme, you can open "25_meme_motif_enrichment_consensus_script.sh" and change the flag argument "--filter 4" into 2 or 3 and rerun. Same thing, you can also generate multiple results. Just play around with the file naming.

If you want to filter with FC or IDR or practically anything else, you can simply open the xxxxx_all_peaks_calculated.tsv file in folder 22, use a spreadsheet application like MS Excel or LibreOffice Calc to freely filter the peaks you want,, save it, and feed it to meme_sequence_extractor.py while removing the flag argument "--filter 4" (see "25_meme_motif_enrichment_consensus_script.sh" for an example; remember to modify the filenames as needed).

Hope this helps! Cheers!

H1889 commented 2 months ago

Perfect explanation, I will do as you said. Respect to GEM this is the err log:

" Warning: The read distribution is not updated, too few (499<500) significant events.

Warning: The read distribution is not updated, too few (496<500) significant events.

chrNC_026614.1.fa file is not in correct FASTA format."

An this is the out log:

"Welcome to GEM (version 2.7)!

Please cite: Yuchun Guo, Shaun Mahony, David K. Gifford (2012) PLoS Computational Biology 8(8): e1002638. High Resolution Genome Wide Binding Event Finding and Motif Discovery Reveals Transcription Factor Spatial Binding Constraints. doi:10.1371/journal.pcbi.1002638

Gifford Laboratory at MIT (http://cgs.csail.mit.edu/gem/).


Start time: 2024/07/02 14:10:20

Loading data... Loading reads from: DARK_CONTROL_chip_rep1.bam ... Loaded Loading reads from: DARK_CONTROL_chip_rep2.bam ... Loaded Loading reads from: DARK_CONTROL_ctrl_rep1.bam ... Loaded

Options: -Xmx10G --k_min 8 --k_max 12 --t 16 --d /media/yomismo/Datos/VOS_CHIPSEQ_RNASEQ/CHIPSEQ/CHIP_AP_GENOME/GEM/Read_Distribution_default.txt --g /media/yomismo/Datos/VOS_CHIPSEQ_RNASEQ/CHIPSEQ/CHIP_AP_GENOME/GEM/NC12.chrom.sizes --genome /media/yomismo/Datos/VOS_CHIPSEQ_RNASEQ/CHIPSEQ/CHIP_AP_GENOME/GEM/NC12_Chr_FASTA --expt /media/yomismo/Datos/VOS_CHIPSEQ_RNASEQ/CHIPSEQ/DARK_CONTROL/08_results/DARK_CONTROL_chip_rep1.bam --expt /media/yomismo/Datos/VOS_CHIPSEQ_RNASEQ/CHIPSEQ/DARK_CONTROL/08_results/DARK_CONTROL_chip_rep2.bam --ctrl /media/yomismo/Datos/VOS_CHIPSEQ_RNASEQ/CHIPSEQ/DARK_CONTROL/08_results/DARK_CONTROL_ctrl_rep1.bam --f SAM --out /media/yomismo/Datos/VOS_CHIPSEQ_RNASEQ/CHIPSEQ/DARK_CONTROL/12_gem_peak_calling/DARK_CONTROL_GEM

Mappable Genome Length is 32.881.902.

Getting 5' positions of all reads...

Original read count stats: _IP Bases: 8223078 HitCounts: 9467707.0 _CTRL Bases: 35228650 HitCounts: 4.2678598E7

Filter duplicate reads. _IP Bases: 8223078 HitCounts: 9121074.0 _CTRL Bases: 35228650 HitCounts: 4.2283689E7

At Poisson p-value 1,0e-03, in a 501bp window, expect 177 reads.

IP: 9121074 Ctrl: 42283689   IP/Ctrl: 0,22

Sorting reads and selecting enriched regions ... For condition , IP/Control = 0,01

The genome is segmented into 43461 regions for analysis.

============================ Round 0 ============================

Running with 16 threads ...

100 /43461 56,1s 1000 /43461 1,1m 10000 /43461 1,3m 20000 /43461 1,3m 43461 /43461 1,3m

Scaling condition , IP/Control = 0,20


Events discovered Significant: 499 Insignificant: 30 Filtered: 47

Total 280 homotypic events (within 501bp of other events).

Finish binding event prediction: 1,3m

============================ Round 1 ============================

Running with 16 threads ...

100 /43461 1,0s 1000 /43461 2,0s 10000 /43461 4,0s 20000 /43461 7,0s 30000 /43461 10,0s 40000 /43461 14,0s 43461 /43461 18,0s

Events discovered Significant: 496 Insignificant: 27 Filtered: 46

Total 280 homotypic events (within 501bp of other events).

Finish binding event prediction: 18,1s

Loading genome sequences ..."

the *_GEM_GPS_events.txt file has 497 lines, these a re the top 10:

osition IP Control Fold Expectd Q-lg10 P-lg10 P_poiss IPvsEMP Noise KmerGroup MotifId KG_hgp Strand NW_011929459.1:94803 35723,6 449,1 79,5 749,1 999,00 999,00 999,00 -0,42 0,28 -------- -1 0.00 NW_011929459.1:96953 32570,1 654,6 49,8 749,1 999,00 999,00 999,00 -0,59 0,67 -------- -1 0.00 NW_011929459.1:96590 28584,5 958,2 29,8 749,1 999,00 999,00 999,00 -0,40 0,67 -------- -1 0.00 NW_011929459.1:97513 26108,7 753,3 34,7 749,1 999,00 999,00 999,00 -0,38 0,67 -------- -1 0.00 NW_011929459.1:94205 17442,8 410,9 42,4 749,1 999,00 999,00 999,00 -0,61 0,33 -------- -1 0.00 NW_011929459.1:97180 16655,7 397,9 41,9 749,1 999,00 999,00 999,00 -0,72 0,67 -------- -1 0.00 NW_011929459.1:93959 14044,4 281,0 50,0 749,1 999,00 999,00 999,00 -0,63 0,33 -------- -1 0.00 NW_011929459.1:97329 13835,9 307,2 45,0 749,1 999,00 999,00 999,00 -0,70 0,67 -------- -1 0.00 NW_011929459.1:94975 13285,0 674,6 19,7 749,1 999,00 999,00 999,00 -0,52 0,28 -------- -1 0.00 NW_011929459.1:94375 10101,5 887,9 11,4 749,1 999,00 999,00 999,00 -0,64 0,33 -------- -1 0.00

An these the bottom 10: NC_026505.1:2319556 462,4 55,0 8,4 389,0 3,82 80,99 3,85 -0,50 0,00 -------- -1 0.00 NC_026505.1:5847216 501,3 66,8 7,5 428,6 3,50 82,95 3,53 -0,62 0,00 -------- -1 0.00 NC_026502.1:3018407 251,6 71,1 3,5 201,6 3,44 24,13 3,46 -0,58 0,00 -------- -1 0.00 NC_026501.1:2234890 248,0 80,7 3,1 198,6 3,37 20,42 3,39 -0,51 0,00 -------- -1 0.00 NC_026503.1:4608131 258,6 47,9 5,4 209,8 3,22 35,72 3,24 -0,67 0,00 -------- -1 0.00 NC_026502.1:2089670 390,1 44,1 8,8 332,4 3,00 69,52 3,02 -0,50 0,00 -------- -1 0.00 NC_026505.1:5715699 294,0 107,3 2,7 244,3 2,94 20,57 2,96 -0,46 0,00 -------- -1 0.00 NC_026501.1:5354740 314,0 59,2 5,3 265,8 2,73 42,44 2,75 -0,57 0,00 -------- -1 0.00 NC_026504.1:2182132 498,0 106,7 4,7 435,6 2,71 60,79 2,74 -0,22 0,60 -------- -1 0.00 NC_026507.1:3908647 373,0 74,4 5,0 324,2 2,34 48,17 2,37 -0,41 0,00 -------- -1 0.00 "

I have run 5 different experimental conditions of N. crassa and none gave GEM peaks. The computation resources are not a problem, my workstation has 32 cpus (64 threads) and 512Gb of RAM.

Thanks again for your help and involvement

JSuryatenggara commented 2 months ago

It seems the problem is caused by your chrNC_026614.1.fa file. Can you try and see if it begins like below:

>NC_026614.1 GGATCCACTCGTAGAGTGGAATCACAGTTCTGTGGAGGCACGGCTATCCTATAGCGCGTCTTGCGACTGCGACGGAGCAG CTGGATTAGTACTGGACGGATCATAATCATCAACCTGGGTAGTTTAAAGGTAAAACCTTAATTTCATACATTAAAGATGA GAATTCGATTTTCTCCCCAGGTTCGCGTGCGCACTATGGAGGGGAAGGGCACTGATCGAGTAGGACCAGAGCTTTCCTTA TTTAAATCTCCAAATTTCTTCAAAATTTATGATAATAATGACTATACTGTCTTTATTACTTACTAATGCCGTTACTTTAA GACGAGATATTTCCATACTTTTTAATAGAATTGTGATTATAGCATTAATTTATTGTATTTTGCATGATACAATGAGTTTA TCTATTATAAGTAAGGGGGTAGGTTTACATGGAGGTTTACTTCATATAACCAATATAACACAAGTATTCCAAATATTTAT

H1889 commented 2 months ago

Correct! The file has not the head >NC_026614.1 and none of the *.fa files of the GEM folder has the head. I will add the hedas manually and re-run the commands and tell you. Thnaks!

JSuryatenggara commented 2 months ago

If you used the chromfa_splitter.py like below:

chromfa_splitter.py --fa [genome folder]/bwa/NC12.fa --output [genome folder]/GEM/NC12_Chr_FASTA/ --gem

Then you should have them all correct, supposedly. I believe I sent you an updated chromfa_splitter.py in one of the attachment above. I tested it and it should give you what you need.

H1889 commented 2 months ago

I have GEM peaks now! However the GEM_IDR_output.tsv at folder 22 is empty. The GEM_IDR_input.narrowpeak seems ok.

Thease are the first lines of the input file:

chrNC_026507.1 1765596 1765646 chrNC_026507.1:1765596-1765646 0 . 135,9 -1 -1 -1 chrNW_011929459.1 94778 94828 chrNW_011929459.1:94778-94828 0 . 79,1 -1 -1 -1 chrNC_026503.1 1585442 1585492 chrNC_026503.1:1585442-1585492 0 . 74,1 -1 -1 -1 chrNC_026505.1 679271 679321 chrNC_026505.1:679271-679321 0 . 69,7 -1 -1 -1 chrNC_026501.1 1837395 1837445 chrNC_026501.1:1837395-1837445 0 . 55,6 -1 -1 -1 chrNW_011929459.1 93935 93985 chrNW_011929459.1:93935-93985 0 . 50,0 -1 -1 -1 chrNW_011929459.1 96939 96989 chrNW_011929459.1:96939-96989 0 . 49,3 -1 -1 -1 chrNC_026505.1 303710 303760 chrNC_026505.1:303710-303760 0 . 46,6 -1 -1 -1 chrNC_026505.1 679769 679819 chrNC_026505.1:679769-679819 0 . 45,4 -1 -1 -1

probably this is the source of the error:

"Peak --rank signal.value --use-nonoverlapping-peaks /home/yomismo/mambaforge/envs/CHIP-AP/bin/idr -s /media/yomismo/Datos/VOS_CHIPSEQ_RNASEQ/CHIPSEQ/DARK_CONTROL/22_peaks_processing/DARK_CONTROL_all_peaks_calculated.narrowPeak /media/yomismo/Datos/VOS_CHIPSEQ_RNASEQ/CHIPSEQ/DARK_CONTROL/22_peaks_processing/IDR_files/GEM_IDR_input.narrowPeak -o /media/yomismo/Datos/VOS_CHIPSEQ_RNASEQ/CHIPSEQ/DARK_CONTROL/22_peaks_processing/IDR_files/GEM_IDR_output.tsv --input-file-type narrowPeak --rank signal.value --use-nonoverlapping-peaks Traceback (most recent call last): File "/home/yomismo/mambaforge/envs/CHIP-AP/bin/idr", line 10, in idr.idr.main() File "/home/yomismo/mambaforge/envs/CHIP-AP/lib/python3.7/site-packages/idr/idr.py", line 840, in main merged_peaks, signal_type = load_samples(args) File "/home/yomismo/mambaforge/envs/CHIP-AP/lib/python3.7/site-packages/idr/idr.py", line 704, in load_samples for fp in args.samples] File "/home/yomismo/mambaforge/envs/CHIP-AP/lib/python3.7/site-packages/idr/idr.py", line 704, in for fp in args.samples] File "/home/yomismo/mambaforge/envs/CHIP-AP/lib/python3.7/site-packages/idr/idr.py", line 54, in load_bed signal = float(data[signal_index]) ValueError: could not convert string to float: '135,9'"

and later:

"Opening file: /media/yomismo/Datos/VOS_CHIPSEQ_RNASEQ/CHIPSEQ/DARK_CONTROL/22_peaks_processing/IDR_files/GEM_IDR_output.tsv Traceback (most recent call last): File "/media/yomismo/Datos/chipap_installation/chipap_scripts/IDR_integrator.py", line 111, in peak_idr_df = pd.read_csv(idr_tsv_full_path, delimiter = '\t', header = None) File "/home/yomismo/mambaforge/envs/CHIP-AP/lib/python3.7/site-packages/pandas/io/parsers.py", line 605, in read_csv return _read(filepath_or_buffer, kwds) File "/home/yomismo/mambaforge/envs/CHIP-AP/lib/python3.7/site-packages/pandas/io/parsers.py", line 457, in _read parser = TextFileReader(filepath_or_buffer, kwds) File "/home/yomismo/mambaforge/envs/CHIP-AP/lib/python3.7/site-packages/pandas/io/parsers.py", line 814, in init self._engine = self._make_engine(self.engine) File "/home/yomismo/mambaforge/envs/CHIP-AP/lib/python3.7/site-packages/pandas/io/parsers.py", line 1045, in _make_engine return mapping[engine](self.f, self.options) # type: ignore[call-arg] File "/home/yomismo/mambaforge/envs/CHIP-AP/lib/python3.7/site-packages/pandas/io/parsers.py", line 1893, in init self._reader = parsers.TextReader(self.handles.handle, **kwds) File "pandas/_libs/parsers.pyx", line 521, in pandas._libs.parsers.TextReader.cinit pandas.errors.EmptyDataError: No columns to parse from file"

Thank you very much. Greetings

JSuryatenggara commented 2 months ago

The problem is exactly this line:

ValueError: could not convert string to float: '135,9'"

For some weird reason, the decimal separators are all commas instead of periods in that .narrowPeak file. From all the GEM outputs you pasted above, it looks like your GEM uses comma as a decimal separator (and period as a thousand separator, at that). This is the first time I saw something like this. Do you have any idea how that happened? Does your computer, by chance, has comma as a decimal separator and period as a thousand separator, as a system default?

H1889 commented 2 months ago

I have checked my decimal and thousands separators with locale: locale thousands_sep , locale decimal_point .

So the system is ok. The input file *_IDR_input.narrowPeak of MACS2, HOMER and Genrich all have decimal with ".". Only the one from GEM has decimals with ",".

The GEM narrowpeaks input file is generated from the event.txt file and that file is generated with "," as decimal, so de problem is focused in the GEM script at step 12. I realized that my yeast data I analysed with chip-ap three monts ago did not have GEM peaks, for that reason I did not see the "," decimal error.

Probably the problem is:

1-python script generating GEM chromosome files was wrong. 2-No GEM peaks were generated. 3-The decimal error can not be exposed. 4-When the GEM chromosome fasta files are correct, GEM gives peaks 5-When GEM peaks are availabe the hidden decimal error is exposed.

I am now checking the command generating GEM peaks as a candidate source for the error.

What is your oppinion? Do you think I am right?

best

H1889 commented 2 months ago

I have solved the problem of decimals. Despite my system uses decimals with ".", GEM runs with Java that does not use the system configurtion, Java uses the language configuration, so I changed my language to american english and the problem was solved. Sorry for the inconvenience. Best

H1889 commented 2 months ago

You can close the issue, I ran all the samples and everything went fine. Thank you very,very for your help Best

JSuryatenggara commented 2 months ago

Good to hear! I am closing this thread. Don't hesitate to start a new issue if you encounter any more problem.