How do I filter the CGmap files to get ~5.5million sites as described in the paper?

Hi,

I am trying to create the input data for TrainPCClocks.R script using the processed data uploaded to GEO: GSE161141. I am having trouble filtering the sites as described in the Rat PCA clock paper. The closest I've gotten is ~4.4million sites by filtering coverage >=10 and col1 by chr 1-20, X, and Y and counting the 80% across samples using col1 and col3 as the unique identifiers of the location of sites.

Is there any more information that can be provided to help explain how the filtering on the cgmap files was done or should be done to get the final 5.5 million sites?

Thanks. Mansi

MorganLevineLab / PC-Clocks

How do I filter the CGmap files to get ~5.5million sites as described in the paper? #12