UcarLab / PEAS

Code repository for PEAS (Predict Enhancers from ATAC-seq), including feature extraction files and easy to use python script for training enhancer models and predicting enhancers using MLP Neural Networks.
MIT License
9 stars 2 forks source link

methods to produce promoterModel.plk #2

Open El-Castor opened 5 years ago

El-Castor commented 5 years ago

Hi,

I'm trying to used PEAS on my own ATACseq data. I have already extract all the feature and performed a model using chromHMM.

My issue is that, the chromHMM output is a .txt file and it's seems different to your promoterModel.plk file when I verfify the content with pickle.

So my first question is, how do you produced this .plk file? and my second question is, when you have do the promoterModel using chromHMM do you have jsut take your ATACseq data only or you complete them with differents epigenetics marks ?

Thanks in advance

clement

ajt986 commented 5 years ago

Hello!

The .pkl files are generated from creating a model using PEASTrainer.py (The Java GUI uses this file in the backend). Models were trained using features obtained directly from the feature extraction method and appending one additional column for the class label for each peak. In our model training we decided on a two-step process that first predicts promoters and producing a .txt annotated feature file (1 for promoter, 0 for enhancers/other annotations). Next, the enhancer are predicted by excluding the promoters and running the generated .txt file through the enhancer .pkl model.

The classes.txt file encodes information about the class labels with three columns in tab delimited format. (column index starting from 0, the annotation label, the numeric class label starting from 0 (values less than 0 are ignored and excluded from training)). If you're trying to train your own model and have your own annotations, you will need to append this class label information to the feature file and provide your own classes.txt file. You should still be able to use the other meta data files (i.e., features.txt and labelencoder.txt) as long as the feature file format remains the same (aside from the appended class label column).

To give an outline of the process:

ATAC-seq data (.bam file) -> Feature Extraction (.txt file) -> Promoter Prediction (.txt file with an additional column of 0 and 1s where 1 indicates that the row is a promoter) -> Enhancer Prediction (.txt file with 0 and 1s where 1 indicates the row is an enhancer)

El-Castor commented 5 years ago

Hi !

First of all, thanks for your fast and complete response !

Sorry I have not understand something. So you right I'm trying to build my own model using your software (more precisly PEAStrainer.py), but the things is that I don't understand how you get this input file : "CD4T_features.txt" because when I check my features file (obtained after use the script extractionFeature.py) I have not the same file. (see below the two file exemple)

your file CD4T_feature.txt :

Chr Start End Peak Score Peak Length Fold Change Summit Pileup Summit Center Distance # of all inserts Insert Mean Long/Short Insert Ratio # of all inserts (0,50] # of all inserts (50,150] # of all inserts (150,300] # of all inserts (300,500] # of all inserts (500,0) # of cuts within peak # of overrepresented cuts Conservation (Mean) GC% CpG% Annotation (HOMER) Distance to TSS Gene Type Known Motif% Denovo Motif% # of CTCF Motifs chr1 10073 10326 26.06927 254 12.77589 24 0.4251968503937008 61 63.26229508196721 0.12962962962962962 0.5901639344262295 0.3114754098360656 0.09836065573770492 0.0 0.0 108 2.0 0.0 0.525692 0.0 Intergenic -1674 pseudo 6.234413965087282E-4 0.08695652173913043 0 chr1 16210 16271 10.42849 62 6.64346 12 -0.03225806451612903 14 89.92857142857142 0.16666666666666666 0.35714285714285715 0.5 0.14285714285714285 0.0 0.0 16 0.0 0.998 0.540984 0.016667 intron 1195 ncRNA 0.0024937655860349127 0.043478260869565216 0

my file _feature_anotated.txt (annotated with Annotate feature step of your software) :*

CMiso1.1chr00 628629 628808 trimmed_ATAC_female_2_S2_001_window200_shift_100_extSize_200_peak_536 1338 . 13.00556 137.44374 133.87466 115 Best Annotation Best Annotation Freqeuncy Annotation Breakdown Unique Annotation Count CMiso1.1chr00 632269 632566 trimmed_ATAC_female_2_S2_001_window200_shift_100_extSize_200_peak_537 4253 . 28.49853 429.43326 425.30017 187 O 1 0 CMiso1.1chr00 638556 638781 trimmed_ATAC_female_2_S2_001_window200_shift_100_extSize_200_peak_538 5820 . 35.74753 586.35229 582.00146 73 mRNA:CMiso1.1chr00g0281121 0.5111111111111111 mRNA:CMiso1.1chr00g0281121(0.5111111111111111) 1 CMiso1.1chr00 1016767 1016929 trimmed_ATAC_female_2_S2_001_window200_shift_100_extSize_200_peak_850 481 . 7.03580 51.07163 48.15739 81 mRNA:CMiso1.1chr00g0281471 0.9938271604938271 mRNA:CMiso1.1chr00g0281471(0.9938271604938271) 1

So my question is How I can get the same file as you to produce my model.plk file?Can you tell me how do you parse your features file that you use to produce the training model please ?

More over, I have not understand two things concerning the argument needed by PEAStrainer.py script: First, concerning the classe.txt file, how/were you define the second columns ? what does it means for exemple AP, OP or GE? And second, the first column (that as 27 as value) correpesond to which columns file ?

To recapitulate, I have do this workflow (for the step 1 and 2) and I want to do this (for step 3 to 5) :

1 - extract features from my bam using extractionFeature.py- > allow to obtained theses output files : -trimmed_ATAC_female_2_S2_001_filtered_conservation.txt -trimmed_ATAC_female_2_S2_001_filtered_insertmetrics.txt -trimmed_ATAC_female_2_S2_001_filtered_peaks.filtered -trimmed_ATAC_female_2_S2_001_filtered_peaks_annotated.bed -trimmed_ATAC_female_2_S2_001_filtered_peaks_ctcf.bed -trimmed_ATAC_female_2_S2_001_filtered_peaks_denovo.bed

2- Annotate feature using annotate commande from PEASTools.jar -> allow to obtained this output file : -trimmed_ATAC_female_2_S2_001_filtered_peaks.filtered.annotated.txt

3-Train Model (in progresse)

4- Annotate promoter, using neural network or TSS

5-Predict Enhancers

Can you tell me if it's the right way ?

To finish, I m wondering if it's relevant to produce a model using chromoHMM with OCRs data AND chipseq data. the idea is to use this model with PEAS ?

Thanks in advance,

Best regards

Le mar. 13 août 2019 à 20:41, Asa Thibodeau notifications@github.com a écrit :

Hello!

The .pkl files are generated from creating a model using PEASTrainer.py (The Java GUI uses this file in the backend). Models were trained using features obtained directly from the feature extraction method and appending one additional column for the class label for each peak. In our model training we decided on a two-step process that first predicts promoters and producing a .txt annotated feature file (1 for promoter, 0 for enhancers/other annotations). Next, the enhancer are predicted by excluding the promoters and running the generated .txt file through the enhancer .pkl model.

The classes.txt file encodes information about the class labels with three columns in tab delimited format. (column index starting from 0, the annotation label, the numeric class label starting from 0 (values less than 0 are ignored and excluded from training)). If you're trying to train your own model and have your own annotations, you will need to append this class label information to the feature file and provide your own classes.txt file. You should still be able to use the other meta data files (i.e., features.txt and labelencoder.txt) as long as the feature file format remains the same (aside from the appended class label column).

To give an outline of the process:

ATAC-seq data (.bam file) -> Feature Extraction (.txt file) -> Promoter Prediction (.txt file with an additional column of 0 and 1s where 1 indicates that the row is a promoter) -> Enhancer Prediction (.txt file with 0 and 1s where 1 indicates the row is an enhancer)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/UcarLab/PEAS/issues/2?email_source=notifications&email_token=ABSR3HPZQ3XZX6PLG4ZFSELQEL57NA5CNFSM4ILMEO6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4GTBTI#issuecomment-520958157, or mute the thread https://github.com/notifications/unsubscribe-auth/ABSR3HOWQZ2O4DVOTMYDPK3QEL57NANCNFSM4ILMEO6A .

ajt986 commented 5 years ago

I think there is an error during the feature extraction step. The final output should also provide a "_sorted_merged_features.txt" file and will have the same format as the example.

For classes.txt, The 27 means that the 27th column of the annotated feature file has the class label. The second column included string labels such as AP, OP, and GE. These were labels that I used to indicate that ChromHMM annotated these peaks as Active Promoter (AP), Other Promoter (OP), or Genic Enhancer (GE). The third column is simply what these labels will be converted to. For example if the 27th column in the annotated feature file is AP, and we would replace the string label with an integer label such as 0 or 1.

I think the main issue is with the feature extraction step though. The rest of the methods depend on getting the "_sorted_merged_features.txt" output. Were there any errors?

El-Castor commented 5 years ago

I think your right, I have check all my files and the file "*._insertmetrics.txt" is empty, do you have an idea to explain this issue ? I have join the log of the feature extraction command if you can check.

thanks for your response !

Le mer. 14 août 2019 à 16:37, Asa Thibodeau notifications@github.com a écrit :

I think there is an error during the feature extraction step. The final output should also provide a "_sorted_merged_features.txt" file and will have the same format as the example.

For classes.txt, The 27 means that the 27th column of the annotated feature file has the class label. The second column included string labels such as AP, OP, and GE. These were labels that I used to indicate that ChromHMM annotated these peaks as Active Promoter (AP), Other Promoter (OP), or Genic Enhancer (GE). The third column is simply what these labels will be converted to. For example if the 27th column in the annotated feature file is AP, and we would replace the string label with an integer label such as 0 or 1.

I think the main issue is with the feature extraction step though. The rest of the methods depend on getting the "_sorted_merged_features.txt" output. Were there any errors?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/UcarLab/PEAS/issues/2?email_source=notifications&email_token=ABSR3HNDLG6XPVTA2OVHABLQEQKDZA5CNFSM4ILMEO6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4JAAOI#issuecomment-521273401, or mute the thread https://github.com/notifications/unsubscribe-auth/ABSR3HLKRJNSK4VMHHGEGNLQEQKDZANCNFSM4ILMEO6A .

El-Castor commented 5 years ago

I have find an error in the log but I don't understand it :

Exception in thread "main" java.lang.reflect.InvocationTargetException

at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at org.eclipse.jdt.internal.jarinjarloader.JarRsrcLoader.main(JarRsrcLoader.java:58) Caused by: java.lang.NoClassDefFoundError: htsjdk/samtools/SamReaderFactory at org.jax.peastools.insertmetrics.PeakThreshold.getThreshold(PeakThreshold.java:44) at org.jax.peastools.insertmetrics.PeakThreshold.doIt(PeakThreshold.java:30) at org.jax.peastools.insertmetrics.PeakThreshold.main(PeakThreshold.java:22) at org.jax.peastools.Main.main(Main.java:34) ... 5 more Caused by: java.lang.ClassNotFoundException: htsjdk.samtools.SamReaderFactory at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:471) at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:588) at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:521) ... 9 more cat: thresh.txt: Aucun fichier ou dossier de ce type Exception in thread "main" java.lang.reflect.InvocationTargetException at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at org.eclipse.jdt.internal.jarinjarloader.JarRsrcLoader.main(JarRsrcLoader.java:58) Caused by: java.lang.NumberFormatException: For input string: "" at java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.base/java.lang.Integer.parseInt(Integer.java:662) at java.base/java.lang.Integer.parseInt(Integer.java:770) at org.jax.peastools.insertmetrics.PeakInsertMetrics.main(PeakInsertMetrics.java:29) at org.jax.peastools.Main.main(Main.java:37) ... 5 more

Le mer. 14 août 2019 à 17:47, clement pichot clementpch@gmail.com a écrit :

I think your right, I have check all my files and the file "*._insertmetrics.txt" is empty, do you have an idea to explain this issue ? I have join the log of the feature extraction command if you can check.

thanks for your response !

Le mer. 14 août 2019 à 16:37, Asa Thibodeau notifications@github.com a écrit :

I think there is an error during the feature extraction step. The final output should also provide a "_sorted_merged_features.txt" file and will have the same format as the example.

For classes.txt, The 27 means that the 27th column of the annotated feature file has the class label. The second column included string labels such as AP, OP, and GE. These were labels that I used to indicate that ChromHMM annotated these peaks as Active Promoter (AP), Other Promoter (OP), or Genic Enhancer (GE). The third column is simply what these labels will be converted to. For example if the 27th column in the annotated feature file is AP, and we would replace the string label with an integer label such as 0 or 1.

I think the main issue is with the feature extraction step though. The rest of the methods depend on getting the "_sorted_merged_features.txt" output. Were there any errors?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/UcarLab/PEAS/issues/2?email_source=notifications&email_token=ABSR3HNDLG6XPVTA2OVHABLQEQKDZA5CNFSM4ILMEO6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4JAAOI#issuecomment-521273401, or mute the thread https://github.com/notifications/unsubscribe-auth/ABSR3HLKRJNSK4VMHHGEGNLQEQKDZANCNFSM4ILMEO6A .

El-Castor commented 5 years ago

Hi,

I think the issue appears when the script extractFeature.py launch this command :

java -jar "/K/FLOCAD/Bioinfo/Workspace/ClementP/DavidL/ATACseq/ATACseq_cytometry_oct2018/machine_learning/test_PEAS_data/PEASTools.jar" -jar insertsizethresh trimmed_ATAC_female_2_S2_001_filtered_sorted.bam /K/FLOCAD/Bioinfo/Workspace/ClementP/DavidL/ATACseq/ATACseq_cytometry_oct2018/machine_learning/test_peaksDiff_customPipe/peak_features/

thresh=$(cat "thresh.txt")

In fact, I don't find the file thresh.txt, it's the command that produce it or it should be me ?

When I launch this command in the terminal I get nothings, no error :

(PEAS) cpichot@node9:/K/FLOCAD/Bioinfo/Workspace/ClementP/DavidL/ATACseq/ATACseq_cytometry_oct2018/machine_learning/test_peaksDiff_customPipe$ java -jar "/K/FLOCAD/Bioinfo/Workspace/ClementP/DavidL/ATACseq/ATACseq_cytometry_oct2018/machine_learning/test_PEAS_data/PEASTools.jar" -jar insertsizethresh trimmed_ATAC_female_2_S2_001_filtered_sorted.bam /K/FLOCAD/Bioinfo/Workspace/ClementP/DavidL/ATACseq/ATACseq_cytometry_oct2018/machine_learning/test_peaksDiff_customPipe/peak_features/ (PEAS) cpichot@node9:/K/FLOCAD/Bioinfo/Workspace/ClementP/DavidL/ATACseq/ATACseq_cytometry_oct2018/machine_learning/test_peaksDiff_customPipe$ ls peak_features/

Le mer. 14 août 2019 à 17:48, clement pichot clementpch@gmail.com a écrit :

I have find an error in the log but I don't understand it :

Exception in thread "main" java.lang.reflect.InvocationTargetException

at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at org.eclipse.jdt.internal.jarinjarloader.JarRsrcLoader.main(JarRsrcLoader.java:58) Caused by: java.lang.NoClassDefFoundError: htsjdk/samtools/SamReaderFactory at org.jax.peastools.insertmetrics.PeakThreshold.getThreshold(PeakThreshold.java:44) at org.jax.peastools.insertmetrics.PeakThreshold.doIt(PeakThreshold.java:30) at org.jax.peastools.insertmetrics.PeakThreshold.main(PeakThreshold.java:22) at org.jax.peastools.Main.main(Main.java:34) ... 5 more Caused by: java.lang.ClassNotFoundException: htsjdk.samtools.SamReaderFactory at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:471) at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:588) at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:521) ... 9 more cat: thresh.txt: Aucun fichier ou dossier de ce type Exception in thread "main" java.lang.reflect.InvocationTargetException at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at org.eclipse.jdt.internal.jarinjarloader.JarRsrcLoader.main(JarRsrcLoader.java:58) Caused by: java.lang.NumberFormatException: For input string: "" at java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.base/java.lang.Integer.parseInt(Integer.java:662) at java.base/java.lang.Integer.parseInt(Integer.java:770) at org.jax.peastools.insertmetrics.PeakInsertMetrics.main(PeakInsertMetrics.java:29) at org.jax.peastools.Main.main(Main.java:37) ... 5 more

Le mer. 14 août 2019 à 17:47, clement pichot clementpch@gmail.com a écrit :

I think your right, I have check all my files and the file "*._insertmetrics.txt" is empty, do you have an idea to explain this issue ? I have join the log of the feature extraction command if you can check.

thanks for your response !

Le mer. 14 août 2019 à 16:37, Asa Thibodeau notifications@github.com a écrit :

I think there is an error during the feature extraction step. The final output should also provide a "_sorted_merged_features.txt" file and will have the same format as the example.

For classes.txt, The 27 means that the 27th column of the annotated feature file has the class label. The second column included string labels such as AP, OP, and GE. These were labels that I used to indicate that ChromHMM annotated these peaks as Active Promoter (AP), Other Promoter (OP), or Genic Enhancer (GE). The third column is simply what these labels will be converted to. For example if the 27th column in the annotated feature file is AP, and we would replace the string label with an integer label such as 0 or 1.

I think the main issue is with the feature extraction step though. The rest of the methods depend on getting the "_sorted_merged_features.txt" output. Were there any errors?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/UcarLab/PEAS/issues/2?email_source=notifications&email_token=ABSR3HNDLG6XPVTA2OVHABLQEQKDZA5CNFSM4ILMEO6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4JAAOI#issuecomment-521273401, or mute the thread https://github.com/notifications/unsubscribe-auth/ABSR3HLKRJNSK4VMHHGEGNLQEQKDZANCNFSM4ILMEO6A .

ajt986 commented 5 years ago

It looks like the htsJDK class files cannot be found when running the java PEASTools.jar program that would produce the thresh.txt:

" Caused by: java.lang.ClassNotFoundException: htsjdk.samtools.SamReaderFactory"

However, I'm not sure why it would be caused here and not earlier when using the same library for preprocessing the bam file for peak calling with MACS. You should see a number of other bam files with your output if these steps are running correctly. Are you using the PEASTools.jar from the most recent release? I think this may be the issue. I'll look into this file. If you're familiar with build jar files, you could try generating the executable jar file from the source code and setting the htsJDK library configuration.

El-Castor commented 5 years ago

Yes I used the PEAS v1.1 that I have donwload here : https://github.com/UcarLab/PEAS/release

I have check on the PEASTools.jar and I have not find the htsjdk class. It should be there right ?

I'm not very familiar with the building of jar file but I will try to do it. I give you a link to the PEASTools.jar that I use to do the analysis. PEASTools.jar https://drive.google.com/file/d/11NfdTD7Kke7kHGl2cmT7WWdHRy5Zqniv/view?usp=drive_web

Le mer. 14 août 2019 à 21:36, Asa Thibodeau notifications@github.com a écrit :

It looks like the htsJDK class files cannot be found when running the java PEASTools.jar program that would produce the thresh.txt:

" Caused by: java.lang.ClassNotFoundException: htsjdk.samtools.SamReaderFactory"

However, I'm not sure why it would be caused here and not earlier when using the same library for preprocessing the bam file for peak calling with MACS. You should see a number of other bam files with your output if these steps are running correctly. Are you using the PEASTools.jar from the most recent release? I think this may be the issue. I'll look into this file. If you're familiar with build jar files, you could try generating the executable jar file from the source code and setting the htsJDK library configuration.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/UcarLab/PEAS/issues/2?email_source=notifications&email_token=ABSR3HMBECIGBGFM7FNJD53QERNFXA5CNFSM4ILMEO6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4J37NI#issuecomment-521387957, or mute the thread https://github.com/notifications/unsubscribe-auth/ABSR3HK5OMJSQBY4QYPJQFLQERNFXANCNFSM4ILMEO6A .

ajt986 commented 5 years ago

I've been running some test scripts and couldn't find an issue withe PEASTools.jar on my end. I did compile an alternate version of PEASTools.jar that extracts the library class files instead of the jar files:

https://github.com/UcarLab/PEAS/releases/download/v1.1a/PEASTools.jar.zip

Replace this PEASTools.jar file with the one you currently have to try it out. Hopefully this will correct the issue missing class issue.

El-Castor commented 5 years ago

Hi !

Thanks you very much to build this PEASTools.jar. I've try to use it this morning but I have an another issue when the script begin to produce the insermetric.txt file. I' ve join the complete log error.

The error is :

readline() on closed filehandle IN at

/opt/share/FLOCAD/userspace/cpichot/miniconda3/envs/PEAS/bin/annotatePeaks.pl line 3720. readline() on closed filehandle IN at /opt/share/FLOCAD/userspace/cpichot/miniconda3/envs/PEAS/bin/annotatePeaks.pl line 3730. Outputing Annotation File... Done annotating peaks file

Exception in thread "main" java.lang.NullPointerException at org.jax.peastools.insertmetrics.PeakInsertMetrics.getInsertMetrics(PeakInsertMetrics.java:67) at org.jax.peastools.insertmetrics.PeakInsertMetrics.main(PeakInsertMetrics.java:31) at org.jax.peastools.Main.main(Main.java:37) cat: trimmed_ATAC_female_2_S2_001_filtered_chr1_insertmetrics.txt: Aucun fichier ou dossier de ce type rm: impossible de supprimer « trimmed_ATAC_female_2_S2_001_filtered_chr1_insertmetrics.txt »: Aucun fichier ou dossier de ce type Exception in thread "main" java.lang.NullPointerException at org.jax.peastools.insertmetrics.PeakInsertMetrics.getInsertMetrics(PeakInsertMetrics.java:67) at org.jax.peastools.insertmetrics.PeakInsertMetrics.main(PeakInsertMetrics.java:31) at org.jax.peastools.Main.main(Main.java:37)

What do you think ?

Thanks a lot to help me

Best regards

Le mer. 14 août 2019 à 22:25, Asa Thibodeau notifications@github.com a écrit :

I've been running some test scripts and couldn't find an issue withe PEASTools.jar on my end. I did compile an alternate version of PEASTools.jar that extracts the library class files instead of the jar files:

https://github.com/UcarLab/PEAS/releases/download/v1.1a/PEASTools.jar.zip

Replace this PEASTools.jar file with the one you currently have to try it out. Hopefully this will correct the issue missing class issue.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/UcarLab/PEAS/issues/2?email_source=notifications&email_token=ABSR3HLKJLHEDKVRVPQF3UDQERS3BA5CNFSM4ILMEO6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4KACXQ#issuecomment-521404766, or mute the thread https://github.com/notifications/unsubscribe-auth/ABSR3HORJ5TT6T6L7GCI373QERS3BANCNFSM4ILMEO6A .

El-Castor commented 5 years ago

I have check the output (peaks_feature) and I see that my file peaks_filtered is empty and not my narrowPeak file. Do you think that the filter that you use to filtered the narrawPeak are to high for my data ? I should change the qValue cutoff ? I don't understand why I have no peaks because I have no error from macs2. I've join a capture of my output and the log file if you want to check.

Le jeu. 15 août 2019 à 10:40, clement pichot clementpch@gmail.com a écrit :

Hi !

Thanks you very much to build this PEASTools.jar. I've try to use it this morning but I have an another issue when the script begin to produce the insermetric.txt file. I' ve join the complete log error.

The error is :

readline() on closed filehandle IN at

/opt/share/FLOCAD/userspace/cpichot/miniconda3/envs/PEAS/bin/annotatePeaks.pl line 3720. readline() on closed filehandle IN at /opt/share/FLOCAD/userspace/cpichot/miniconda3/envs/PEAS/bin/annotatePeaks.pl line 3730. Outputing Annotation File... Done annotating peaks file

Exception in thread "main" java.lang.NullPointerException at org.jax.peastools.insertmetrics.PeakInsertMetrics.getInsertMetrics(PeakInsertMetrics.java:67) at org.jax.peastools.insertmetrics.PeakInsertMetrics.main(PeakInsertMetrics.java:31) at org.jax.peastools.Main.main(Main.java:37) cat: trimmed_ATAC_female_2_S2_001_filtered_chr1_insertmetrics.txt: Aucun fichier ou dossier de ce type rm: impossible de supprimer « trimmed_ATAC_female_2_S2_001_filtered_chr1_insertmetrics.txt »: Aucun fichier ou dossier de ce type Exception in thread "main" java.lang.NullPointerException at org.jax.peastools.insertmetrics.PeakInsertMetrics.getInsertMetrics(PeakInsertMetrics.java:67) at org.jax.peastools.insertmetrics.PeakInsertMetrics.main(PeakInsertMetrics.java:31) at org.jax.peastools.Main.main(Main.java:37)

What do you think ?

Thanks a lot to help me

Best regards

Le mer. 14 août 2019 à 22:25, Asa Thibodeau notifications@github.com a écrit :

I've been running some test scripts and couldn't find an issue withe PEASTools.jar on my end. I did compile an alternate version of PEASTools.jar that extracts the library class files instead of the jar files:

https://github.com/UcarLab/PEAS/releases/download/v1.1a/PEASTools.jar.zip

Replace this PEASTools.jar file with the one you currently have to try it out. Hopefully this will correct the issue missing class issue.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/UcarLab/PEAS/issues/2?email_source=notifications&email_token=ABSR3HLKJLHEDKVRVPQF3UDQERS3BA5CNFSM4ILMEO6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4KACXQ#issuecomment-521404766, or mute the thread https://github.com/notifications/unsubscribe-auth/ABSR3HORJ5TT6T6L7GCI373QERS3BANCNFSM4ILMEO6A .

ajt986 commented 5 years ago

I think there is an issue with chromosome naming. The tool is expecting chromosomes to be named "chr1", "chr2", ... "chr22", and is probably the reason why the filtered peak file is empty.

I noticed from your previous output that you have chromosomes like this: "CMiso1.1chr00"

Basically, what's happening is that the tool is filtering out all of the peaks because they aren't in chromsomes 1-22. The tool was intended for hg19.

It's possible to get around this though by editing the shell script (PEASFeatureExtraction.sh).

Put a # in front of

java -jar "${jarpath}PEASTools.jar" filter "${prefix}_peaks.narrowPeak" "${filterpeaks}" "${prefix}_peaks.filtered"

So it looks something like this:

#Filter error prone regions
echo "--- Filtering peaks. ---"
#java -jar "${jarpath}PEASTools.jar" filter "${prefix}_peaks.narrowPeak" "${filterpeaks}" "${prefix}_peaks.filtered"

This will remove the peak filtering step, retaining the peaks.

Next, you will need to identify all the bam file chromosomes. The bam files for each chromosome may still be in the output directory. You should see the sorted bam file split by chromosome.

Instead of the following code:

#Get Insert features
echo "--- Getting insert features. ---"
for i in {1..22}
do
    chr=chr$i
    java -jar "${jarpath}PEASTools.jar" insertmetrics "$chr" "${chr}.bam" "${prefix}_peaks.filtered" "${prefix}_${chr}_insertmetrics.txt" "$thresh"
    rm ${chr}.bam
    cat ${prefix}_${chr}_insertmetrics.txt >> ${prefix}_insertmetrics.txt
rm "${prefix}_${chr}_insertmetrics.txt"
done

You will need to manually insert the following as a replacement for the above code:

java -jar "${jarpath}PEASTools.jar" insertmetrics "$chr" "${chr}.bam" "${prefix}_peaks.filtered" "${prefix}_${chr}_insertmetrics.txt" "$thresh"

Replacing the $chr and ${chr} variables with the name of the chromosome for each chromosome in the bam file.

It's a little messy but it will allow the tool to work with the chromosomes in your bam file.

ucarduygu commented 5 years ago

I am a bit concerned about the quality of your data. Can you please answer the following before we 'debug' the code?

El-Castor commented 5 years ago

Hi !

concerning your first response I've try your custom command. But it doesn't work yet (I have join the log error file). The issue is this :

Here you have the command that I used, It is the good one ? I mean I have well understand your recommandation ?

Get Insert features

echo "--- Getting insert features. ---"

jarpath="/K/FLOCAD/Bioinfo/Workspace/ClementP/DavidL/ATACseq/ATACseq_cytometry_oct2018/machine_learning/test_PEAS_data/" prefix="trimmed_ATAC_female_2_S2_001_filtered" thresh=$(cat "thresh.txt")

for i in {00..12} do chr=CMiso1.1chr$i java -jar "${jarpath}PEASTools.jar" insertmetrics "$chr" "${chr}.bam" "${prefix}peaks.filtered" "${prefix}${chr}_insertmetrics.txt" "$thresh"

rm ${chr}.bam

cat ${prefix}_${chr}_insertmetrics.txt >> ${prefix}_insertmetrics.txt

rm "${prefix}_${chr}_insertmetrics.txt"

done

concerning your question about my data, I'm working on Cucumis melo and not in Human . With default value using macs2 I Have in output more or less 30K at 40K of Peaks, depending on the condition. I have now 2 replicate for each condition. The fact is that in this analysis with PEAS, I use the differnetially peak (obtained by my own pipeline) between my two condition (25 000 peaks). I used the genome PacBio of cucumis melo that we produce in my lab ( very great genome quality). The tissue that I study is male and female flowers of cucumis melo at the early stage of devlopment. I hop I have respond to your question, don't hesitate if i'm not clear.

Best regards

Le jeu. 15 août 2019 à 14:47, ucarduygu notifications@github.com a écrit :

I am a bit concerned about the quality of your data. Can you please answer the following before we 'debug' the code?

  • How many peaks do you call with macs2 with default q-value setting?
  • Is this all human data. Which reference genome did you use?
  • What is your tissue?
  • Do you have replicates?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/UcarLab/PEAS/issues/2?email_source=notifications&email_token=ABSR3HIY43NYBLVY5BZY5GLQEVF7TA5CNFSM4ILMEO6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4LWYHI#issuecomment-521628701, or mute the thread https://github.com/notifications/unsubscribe-auth/ABSR3HMCSRCLYAGVPQ5WODDQEVF7TANCNFSM4ILMEO6A .

ajt986 commented 5 years ago

It looks like it's still working off of the filtered peak file.

Swap out "${prefix}_peaks.filtered" with "${prefix}_peaks.narrowPeak" in the loop:

for i in {00..12} do 
   chr=CMiso1.1chr$i 
   java -jar "${jarpath}PEASTools.jar" insertmetrics "$chr" "${chr}.bam" "${prefix}_peaks.narrowPeak" "${prefix}_${chr}_insertmetrics.txt" "$thresh" 

done

It will work off of the MACS peak calls instead of the filtered one that doesn't have any peaks.

El-Castor commented 5 years ago

Hi !

Sorry I was not in my lab this week.

I don't think it's the solution because my file "${peaks}_filtered" is not empty (I have 24K peaks underlight in it). But I have try your idea:

So here you have the command that I used :

/K/FLOCAD/Bioinfo/Workspace/ClementP/DavidL/ATACseq/ATACseq_cytometry_oct2018/machine_learning/test_PEAS_data/PEASFeatureExtraction_custom_bypass_macs2_plus_inserMetricCustom.sh

/K/FLOCAD/Bioinfo/Workspace/ClementP/DavidL/ATACseq/ATACseq_cytometry_oct2018/mapping/filtered/ trimmed_ATAC_female_2_S2_001_filtered /K/FLOCAD/Bioinfo/Workspace/ClementP/DavidL/ATACseq/ATACseq_cytometry_oct2018/machine_learning/test_peaksDiff_customPipe /K/FLOCAD/DATA/OMICS/Melon/DNAseq/AdnaneB/Genome_PacBio/toulouse_assemblage/CMiso1.1/20180606/CMiso1.1_genome.fa CMiso1.1Melo /K/FLOCAD/Bioinfo/Workspace/ClementP/DavidL/ATACseq/ATACseq_cytometry_oct2018/peaks/trimmed_ATAC_female_2_S2_001_window200_shift_100_extSize_200_peaks.narrowPeak /K/FLOCAD/Bioinfo/Workspace/ClementP/DavidL/ATACseq/ATACseq_cytometry_oct2018/findMotifsHomer/findMotifsGenome_differential_peaks_obtainedWithDA/output/motifs_homer_analysis_motifSize8_peaksDiff_fem2male1/homerMotifs.motifs8 /K/FLOCAD/Bioinfo/Workspace/ClementP/DavidL/ATACseq/ATACseq_cytometry_oct2018/machine_learning/fem2_test1/CMiso1.1_genome.fa.chrom.sizes_fakeConservation.bed /K/FLOCAD/Bioinfo/Workspace/ClementP/DavidL/ATACseq/ATACseq_cytometry_oct2018/machine_learning/test_PEAS_data/extraction_files/CTCF.motifs /K/FLOCAD/Bioinfo/Workspace/ClementP/DavidL/ATACseq/ATACseq_cytometry_oct2018/machine_learning/test_PEAS_data/

And the error :

Exception in thread "main" java.lang.NullPointerException

at org.jax.peastools.insertmetrics.PeakInsertMetrics.getInsertMetrics(PeakInsertMetrics.java:67) at org.jax.peastools.insertmetrics.PeakInsertMetrics.main(PeakInsertMetrics.java:31) at org.jax.peastools.Main.main(Main.java:37) cat: trimmed_ATAC_female_2_S2_001_filtered_CMiso1.1chr00_insertmetrics.txt: Aucun fichier ou dossier de ce type

I have join the log error file and a screen shot of the peaks_features folder.

Can you explain how this command work, when the file trimmed_ATAC_female_2_S2_001_filtered_CMiso1.1chr00_insertmetrics.txt is created ? More over I don't understand whythe script produces bam file for 22 chr because If I remenber weel we have fix this no ?

Do you have any idea to fix the problem?

Thanks in advance

Le ven. 16 août 2019 à 14:57, Asa Thibodeau notifications@github.com a écrit :

It looks like it's still working off of the filtered peak file.

Swap out "${prefix}_peaks.filtered" with "${prefix}_peaks.narrowPeak" in the loop:

for i in {00..12} do chr=CMiso1.1chr$i java -jar "${jarpath}PEASTools.jar" insertmetrics "$chr" "${chr}.bam" "${prefix}peaks.narrowPeak" "${prefix}${chr}insertmetrics.txt" "$thresh" #rm ${chr}.bam cat ${prefix}${chr}_insertmetrics.txt >> ${prefix}insertmetrics.txt #rm "${prefix}${chr}_insertmetrics.txt" done

It will work off of the MACS peak calls instead of the filtered one that doesn't have any peaks.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/UcarLab/PEAS/issues/2?email_source=notifications&email_token=ABSR3HNGA6SQQFRTXVO7XKDQE2P3VA5CNFSM4ILMEO6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4ORMGY#issuecomment-521999899, or mute the thread https://github.com/notifications/unsubscribe-auth/ABSR3HKOHU5UZIKPGZ63YZDQE2P3VANCNFSM4ILMEO6A .

ajt986 commented 5 years ago

I forgot that I enforced the chromosomes to be chr1-22 when creating the BAM files to avoid creating files aligned to alternative chromosomes and taking up more space than intended. I've updated the alternative PEASTools.jar in the release section. This will create a bam file for each of the chromosomes present in the bam file. I'm assuming the bam files created for each chromosome are nearly empty on your end?

After running the following line: java -jar "${jarpath}PEASTools.jar" insertsizethresh "${prefix}_sorted.bam" "${outDir}/peak_features"

Bam files for each chromosome should be created. Check to make sure that the filenames of these bam files match the chromosomes in your peak file. If they do, this should fix the error that you are seeing.

El-Castor commented 5 years ago

Hi,

Ok I understand now, yes there are empty. I will try that you said today. I tell you what's going on

Thanks a lot

Le lun. 26 août 2019 à 18:14, Asa Thibodeau notifications@github.com a écrit :

I forgot that I enforced the chromosomes to be chr1-22 when creating the BAM files to avoid creating files aligned to alternative chromosomes and taking up more space than intended. I've updated the alternative PEASTools.jar in the release section. This will create a bam file for each of the chromosomes present in the bam file. I'm assuming the bam files created for each chromosome are nearly empty on your end?

After running the following line: java -jar "${jarpath}PEASTools.jar" insertsizethresh "${prefix}_sorted.bam" "${outDir}/peak_features"

Bam files for each chromosome should be created. Check to make sure that the filenames of these bam files match the chromosomes in your peak file. If they do, this should fix the error that you are seeing.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/UcarLab/PEAS/issues/2?email_source=notifications&email_token=ABSR3HKC6QEIAAUJCJTQCKLQGP6QFA5CNFSM4ILMEO6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5E3LMQ#issuecomment-524924338, or mute the thread https://github.com/notifications/unsubscribe-auth/ABSR3HNCNBLMD2GMAUGJ2VLQGP6QFANCNFSM4ILMEO6A .

El-Castor commented 5 years ago

Hi, I've download your releas and relanche the custom script as you tell me.

I've this error :

17:24:53.017 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/opt/share/FLOCAD/userspace/cpichot/miniconda3/envs/ATACseq/share/picard-2.17.10-0/picard.jar!/com/intel/gkl/native/libgkl_compression.so [Mon Aug 26 17:24:53 CEST 2019] CollectInsertSizeMetrics HISTOGRAM_FILE=/K/FLOCAD/Bioinfo/Workspace/ClementP/DavidL/ATACseq/ATACseq_cytometry_oct2018/machine_learning/test_peaksDiff_customPipe/peak_features/bamChrFile/CMiso1.1chr12.bam_insermetricsHistomgramme.txt INPUT=/K/FLOCAD/Bioinfo/Workspace/ClementP/DavidL/ATACseq/ATACseq_cytometry_oct2018/machine_learning/test_peaksDiff_customPipe/peak_features/bamChrFile/CMiso1.1chr12.bam OUTPUT=/K/FLOCAD/Bioinfo/Workspace/ClementP/DavidL/ATACseq/ATACseq_cytometry_oct2018/machine_learning/test_peaksDiff_customPipe/peak_features/bamChrFile/CMiso1.1chr12.bam_insermetrics.txt DEVIATIONS=10.0 MINIMUM_PCT=0.05 METRIC_ACCUMULATION_LEVEL=[ALL_READS] INCLUDE_DUPLICATES=false ASSUME_SORTED=true STOP_AFTER=0 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false [Mon Aug 26 17:24:53 CEST 2019] Executing as cpichot@node9 on Linux 3.16.0-10-amd64 amd64; OpenJDK 64-Bit Server VM 1.8.0_121-b15; Deflater: Intel; Inflater: Intel; Picard version: 2.17.10-SNAPSHOT WARNING 2019-08-26 17:24:53 SinglePassSamProgram File reports sort order 'unsorted', assuming it's coordinate sorted anyway. [Mon Aug 26 17:24:53 CEST 2019] picard.analysis.CollectInsertSizeMetrics done. Elapsed time: 0.00 minutes. Runtime.totalMemory()=514850816 To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp Exception in thread "main" htsjdk.samtools.SAMFormatException: Error parsing text SAM file. Empty sequence dictionary.; File /K/FLOCAD/Bioinfo/Workspace/ClementP/DavidL/ATACseq/ATACseq_cytometry_oct2018/machine_learning/test_peaksDiff_customPipe/peak_features/bamChrFile/CMiso1.1chr12.bam; Line 1 Line: NB501040:117:HT2NMBGX7:4:21607:3154:8964 99 CMiso1.1chr12 36 17 76M = 40 80 CCCCTAACCCCGAACCCCTAACCCTGAACCCTGAACCCTGAACCCTCAACCCTCAACCCTGAACCCTCAACCCTCA /AAAAEEEEEEEEEEEEE/EEEEE/A/AAEEAEE6AEEEA//AEEEEEEEEEEEE/EEE/6AAAEAAE//EEE//A MD:Z:76 PG:Z:MarkDuplicates XG:i:0 NM:i:0 XM:i:0 XN:i:0 XO:i:0 AS:i:0 YS:i:-3 YT:Z:CP at htsjdk.samtools.SAMLineParser.reportErrorParsingLine(SAMLineParser.java:457) at htsjdk.samtools.SAMLineParser.parseLine(SAMLineParser.java:355) at htsjdk.samtools.SAMTextReader$RecordIterator.parseLine(SAMTextReader.java:268) at htsjdk.samtools.SAMTextReader$RecordIterator.next(SAMTextReader.java:255) at htsjdk.samtools.SAMTextReader$RecordIterator.next(SAMTextReader.java:228) at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:576) at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:548) at picard.analysis.SinglePassSamProgram.makeItSo(SinglePassSamProgram.java:136) at picard.analysis.SinglePassSamProgram.doWork(SinglePassSamProgram.java:84) at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:269) at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:98) at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:108)

I don't know why but the bam are not created. Do you have an idea?

Le mar. 27 août 2019 à 11:09, clement pichot clementpch@gmail.com a écrit :

Hi,

Ok I understand now, yes there are empty. I will try that you said today. I tell you what's going on

Thanks a lot

Le lun. 26 août 2019 à 18:14, Asa Thibodeau notifications@github.com a écrit :

I forgot that I enforced the chromosomes to be chr1-22 when creating the BAM files to avoid creating files aligned to alternative chromosomes and taking up more space than intended. I've updated the alternative PEASTools.jar in the release section. This will create a bam file for each of the chromosomes present in the bam file. I'm assuming the bam files created for each chromosome are nearly empty on your end?

After running the following line: java -jar "${jarpath}PEASTools.jar" insertsizethresh "${prefix}_sorted.bam" "${outDir}/peak_features"

Bam files for each chromosome should be created. Check to make sure that the filenames of these bam files match the chromosomes in your peak file. If they do, this should fix the error that you are seeing.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/UcarLab/PEAS/issues/2?email_source=notifications&email_token=ABSR3HKC6QEIAAUJCJTQCKLQGP6QFA5CNFSM4ILMEO6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5E3LMQ#issuecomment-524924338, or mute the thread https://github.com/notifications/unsubscribe-auth/ABSR3HNCNBLMD2GMAUGJ2VLQGP6QFANCNFSM4ILMEO6A .

ajt986 commented 5 years ago

Could you please list the unique chromosomes from your peak file as a comment? I'm trying to understand whether or not "CMiso1.1chr12.bam" is a chromosome in your data or if something is just appending "chr12"

Although this line:

SinglePassSamProgram File reports sort order
'unsorted', assuming it's coordinate sorted anyway.

makes me think that the header of the bam file is missing or that the tool is processing an empty bam file.