adoebley / Griffin

A flexible framework for nucleosome profiling of cell-free DNA
Other
24 stars 16 forks source link

Running Griffin on TSS (follow-up) #14

Closed Irfanwustl closed 1 month ago

Irfanwustl commented 1 year ago

Hi, (This is a follow-up question from this issue as it is closed now) Thank you so much for the fix. It is almost working but I am facing two issues now.

Issue 1: For some TSS, I am getting an error: Intel MKL ERROR: Parameter 6 was incorrect on entry to DGELSD. I have attached such a region SYN3.txt. Using the demo file from here, we can reproduce the error.

Issue 2: If I supply the TSS where I am not getting the error mentioned above, I am facing another issue. It seems in the generated tsv result file, the site name is all same and it is the site name provided by the sites.yaml file. (as I am providing one file for all TSS, I just mentioned one site name in the yaml file) I have attached such a result for 10 TSS. It has 10 rows but the site name is all same. Here is the result file: Healthy_demo.GC_corrected.coverage.txt. And the corresponding sites.yaml: sites.txt.

I think I am missing something. Will you please have a look?

Best, Irfan

GlenRoarke commented 1 year ago

Hi Irfan,

Thank you for raising your issues and questions around TSS , they have been really helpful to me and the team who have had similar questions. I am also attempting to run Griffin using the config_TSS.yaml. It would be great to agree an example of the correct sites.txt file format to use. Here is an example TSS file I used. GATA1_1.h38.TSS.txt

The columns Gene_id & Griffin_site are used outside of griffin and are not required.

I was wondering whether it was necessary to add in the start and end position columns, these were missing from my TSS database source?

Best wishes,

Glen Roarke

yamawada commented 1 year ago

Hi @adoebley I am facing the same error. I used the following config.yaml (using modified paths and names) and site.txt. config : https://github.com/adoebley/Griffin/blob/development/snakemakes/griffin_nucleosome_profiling/config/config_TSS.yaml site.txt: TSS.ensembl.104.txt

Judging from the error messages, it seems to be caused by the same problem as Irfan. I would appreciate it if you could tell me how to deal with it.

adoebley commented 2 months ago

Hi, Thank you for the question and sorry for the huge delay in responding to this!

For Issue 1: "Intel MKL ERROR: Parameter 6 was incorrect on entry to DGELSD."

This seems to be related to how the saviksky golay filter handles nan values close to the end of the array. I'm not sure why it changed but I did figure out how to fix it. When setting up the conda environment, run this additional line of code (I've added this to the wiki tutorial):

conda install nomkl numpy scipy

I found this solution here: https://github.com/etal/cnvkit/issues/508

If needed, here is some minimal code to reproduce the error:

import numpy as np
from scipy.signal import savgol_filter

rng = np.random.default_rng() 
data = rng.standard_normal(100) 
savgol_filter(data, 11, 3) #does not cause an error
data[95] = np.nan #add an Nan near the end of the array
savgol_filter(data, 11, 3) #causes error if nomkl isn’t installed

I will get back to you on issue 2 shortly!

adoebley commented 2 months ago

For issue 2: Site names is the same in every row

For individual sites, griffin outputs the chromosome, position, and strand of each site but doesn't keep any additional metadata from the input file (like the name of the individual site). So what you are seeing is the expected behavior. To get the site names back, you'll need to merge on chromosome, position, (and strand if used).

At some point I'd like to update griffin to keep all the input columns in the output but currently I haven't implemented that feature. The initial version of griffin was intended for analyzing average sites and when I added an option to keep individual sites I didn't make the necessary modifications to the output format.

Hope this is helpful, let me know if you have further questions on this!

adoebley commented 2 months ago

Hi @GlenRoarke,

To answer your question: "I was wondering whether it was necessary to add in the start and end position columns, these were missing from my TSS database source?"

Griffin only requires two columns specifying the chromosome and position (plus an optional strand column if the sites are directional).

The names, for these columns are specified in config.yaml. The defaults are "Chrom", "position" and "Strand" but these can be changed if your sites file uses different names. For instance if your sites files has columns "Chr" and "TSS" rather than "Chrom" and "position" you can specify these as follows:

chrom_column: Chr
position_column: TSS
strand_column: Strand

All other columns in the sites file are ignored.

Hope this is helpful, let me know if you have any other questions!