angieyen / ChromDiff

ChromDiff program as described in Yen and Kellis, Nature Communications 2015.
http://compbio.mit.edu/ChromDiff
GNU General Public License v3.0
11 stars 2 forks source link

Errors from featnames_bgvals.sh script #3

Closed aayushraman closed 7 years ago

aayushraman commented 8 years ago

Hi Angela,

I hope you are doing good. I have a couple of questions regarding your ChromDiff pipeline. I using my dataset for running ChromDiff. I am using segments files obtained from ChromHMM output for calculation of the features in ChromDiff pipeline (Step 2).

I would like to know why are you doing following tests in your featnames_bgvals.sh script:

> numerrs_perc = `awk '($1!="NA"){sum=0; i=1; while(i<=NF){sum+=$i; i+=1;}; if (sum<.99999) {print NR, sum}}' perc/${curr_label}/numsonly/*_numsonly.txt | wc -l`
> if [ $numerrs_perc -ne 0 ]; then echo "$numerrs_perc errors in $curr_label files" ; exit 1; fi

My run is getting aborted and it says that I have: 179415 errors in core_gencode_v10 files I am not sure why I am getting this error. Can you please help me understand the reasons behind the error ? Also, can you please explain me the reasoning behind the test ?

angieyen commented 7 years ago

Hi Ayush,

I assume you may have moved on or figured it out already, but I thought I would explain it just for completion. That script is calculating the percent of each gene that is labeled as each chromatin state. Therefore, the sum of those values for each gene should be equal to (or nearly equal) to 1. That test ensures that the percentage values look as they should before moving forward with the analysis.

If it is throwing an error, it looks like there may be a problem with your files. Do your segment files give a single chromatin state for every 200 bp region of the genome? If not, (if they only have calls in some parts of the genome), that could cause an error. Do your file formats match the specifications listed in the README file?

Did you try running the command and viewing the results? For example:

awk '($1!="NA"){sum=0; i=1; while(i<=NF){sum+=$i; i+=1;}; if (sum<.99999) {print NR, sum}}' perc/core_gencode_v10/nums_only/*_numsonly.txt > temp_err.txt

Then you can look at the tmp_err.txt file and check for any errors.

Ideally, the file would be empty but the error message you are getting suggests that there are 179,000+ lines, where each line is showing the line number of the original file and the sum of the percentages it found. Again, we want the sum to be 1 so you should be seeing numbers less than 1. Perhaps something like

1 .7

That would mean that in line 1, for that gene, all the percentage values only added up to .7 (70%) rather than 1 (100%). Again, this should likely be an error with your input file values or file format, or perhaps a previous run of calculating the percentage values got aborted or cut off early?

Good luck! Angela

rtmag commented 6 years ago

Hello Angela. I'm having the exact same error. I've checked my Chromatin state annotations files and the look fine.

Could this be an issue with the format of the Gene IDs and locations file?