gchen98 / macs

Automatically exported from code.google.com/p/macs
16 stars 6 forks source link

Reduced variation in new MaCS version (as of 13th of Dec) #10

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. Run "old" and new version of MaCS with the shell scrip bellow to generate 
haplotype file
2. Run the R script bellow on haplotype file to compute site specific allele 
frequencies and plot them

What is the expected output? What do you see instead?
With previous version I got much more variation - allele frequencies covered 
whole spectrum [0-1], while now values are very uniform. In the attached plots 
I show this frequencies (y axis) versus site index (x axis). I also noticed 
that in the haplotype file there are now a lot of haplotypes with only zero 
alleles. In some cases (I can not reproduce this) I saw for the very first 
haplotype some other values than 0 and 1 in the first few sites.

What version of the product are you using? On what operating system?
Version as of 13th od Dec checked out from Google Code and compiled on Ubunut 
Linux 12.04. I do not know the version of old binary.

Please provide any additional information below.

#-------------------- START SHELL SCRIPT ---------------------------------------

#!/bin/bash 

# Set up number of haplotypes
nHaps=400

# Call MaCS
./macs $nHaps        \
       810000000     \
       -t 0.00000036 \
       -r 0.00000036 \
       -eN 0.05 2 -eN 0.1 4 -eN 0.15 6 -eN 0.2 8 -eN 0.25 10 -eN 0.3 12 -eN 0.35 14 -eN 0.4 16 -eN 0.45 18 -eN 0.5 20 -eN 1 40 -eN 2 60 -eN 3 80 -eN 4 100 -eN 5 120 -eN 10 140 -eN 20 160 -eN 30 180 -eN 40 200 -eN 50 240 -eN 100 320 -eN 200 400 -eN 300 480 -eN 400 560 -eN 500 640 1>output.txt 2>debug.txt 

# Format as in ms
cat output.txt | ./msformatter > haplotypes.txt 

# Tear appart the output
# ... header
tail -n +6 haplotypes.txt > Intermediate.txt
# ... haplotypes
tail -n +2 Intermediate.txt > MacsHaplotypes.txt
# ... map
head -n 1 Intermediate.txt | sed -e 's/positions: //' > PhysicalMapInput.txt
# ... number of segregating sites
grep segsites haplotypes.txt | sed  -e 's/segsites: //' > SegSites.txt
# ... get nice format with spaces
sed -e 's/0/ 0/g' -e 's/1/\ 1/g' MacsHaplotypes.txt > tmp
# ... save outgroup haplotypes in two files
head -n $nHaps tmp > MacsHaplotypes.txt

#-------------------- STOP  SHELL SCRIPT ---------------------------------------

#-------------------- START R SCRIPT -------------------------------------------

hap1 <- read.table(file="MacsHaplotypes.txt")
af1 <- colMeans(hap1)

png(file="AFplot.png", width=480*3, height=480*3)
plot(af1, pch=19, cex=0.5, col=rgb(178/255, 34/255, 34/255, alpha=1), ylim=c(0, 
1))
dev.off()

png(file="AFhist.png", width=480*3, height=480*3)
hist(af1, xlim=c(0, 1))
dev.off()

#-------------------- STOP  R SCRIPT -------------------------------------------

Original issue reported on code.google.com by gregor.g...@gmail.com on 3 Jan 2013 at 10:50

Attachments:

GoogleCodeExporter commented 9 years ago
You may have been using an old version of msformat.cpp that didn't accommodate 
the new mutation age field. I've checked in a new revision that removes the 
executable from the source tree.  Please check this one out.  Run "make all" 
and see if anything changes.

Original comment by gche...@gmail.com on 3 Jan 2013 at 11:11

GoogleCodeExporter commented 9 years ago
Tried with latest checkout and compiled, but the effect is still there. Let me 
know if I can supply you any more information.

Original comment by gregor.g...@gmail.com on 3 Jan 2013 at 11:52

GoogleCodeExporter commented 9 years ago
Please post a command line with a random seed -s specified and also post your 
stdout and stderr files.

Original comment by gche...@gmail.com on 4 Jan 2013 at 12:29

GoogleCodeExporter commented 9 years ago
Bellow is a modified script with seed option. Attached are also bzip2 archives 
for stdout and stderr files. 

#-------------------- START SHELL SCRIPT ---------------------------------------

#!/bin/bash 

# Set up number of haplotypes
nHaps=400

# Call MaCS
./macs $nHaps        \
       810000000     \
       -s 19791123   \
       -t 0.00000036 \
       -r 0.00000036 \
       -eN 0.05 2 -eN 0.1 4 -eN 0.15 6 -eN 0.2 8 -eN 0.25 10 -eN 0.3 12 -eN 0.35 14 -eN 0.4 16 -eN 0.45 18 -eN 0.5 20 -eN 1 40 -eN 2 60 -eN 3 80 -eN 4 100 -eN 5 120 -eN 10 140 -eN 20 160 -eN 30 180 -eN 40 200 -eN 50 240 -eN 100 320 -eN 200 400 -eN 300 480 -eN 400 560 -eN 500 640 1>stdout.txt 2>stderr.txt

# Format as in ms
cat output.txt | ./msformatter > haplotypes.txt

# Tear appart the output
# ... header
tail -n +6 haplotypes.txt > Intermediate.txt
# ... haplotypes
tail -n +2 Intermediate.txt > MacsHaplotypes.txt
# ... map
head -n 1 Intermediate.txt | sed -e 's/positions: //' > PhysicalMapInput.txt
# ... number of segregating sites
grep segsites haplotypes.txt | sed  -e 's/segsites: //' > SegSites.txt
# ... get nice format with spaces
sed -e 's/0/ 0/g' -e 's/1/\ 1/g' MacsHaplotypes.txt > tmp
# ... save outgroup haplotypes in two files
head -n $nHaps tmp > MacsHaplotypes.txt

#-------------------- STOP  SHELL SCRIPT ---------------------------------------

Original comment by gregor.g...@gmail.com on 4 Jan 2013 at 1:00

Attachments:

GoogleCodeExporter commented 9 years ago
Argh, I have changed in script the resulting stdout and stderr filenames in 
macs call to make it easier for you to spot. However, I forgot to change it 
later in the script. To make the above script work replace stdout.txt with 
output.txt.

Original comment by gregor.g...@gmail.com on 4 Jan 2013 at 1:02

GoogleCodeExporter commented 9 years ago
it's not clear to me from stdout which sites were monomorphic (no 1s).  can you 
give me example of site IDs?

Original comment by gche...@gmail.com on 4 Jan 2013 at 1:11

GoogleCodeExporter commented 9 years ago
I think I know what's going on.  stay tuned

Original comment by gche...@gmail.com on 4 Jan 2013 at 1:21

GoogleCodeExporter commented 9 years ago
Can you try my latest revision of msformat.cpp?  I found a bug that is most 
likely the culprit.

Original comment by gche...@gmail.com on 4 Jan 2013 at 1:28

GoogleCodeExporter commented 9 years ago
No, I was not saying that sites were monomorphic. I noticed that some 
haplotypes were "monomorphic" - several instances of haplotypes with just 0 or 
1 (at least very long stretches). I am attaching screenshots of haplotype.txt 
file (in ms format) where you can clearly see the differences, while the call 
was the same, just version of macs and msformatter are different.

If you check allele frequencies (=mean of allele codes for each site) from the 
new and old version you will see huge difference as well - shown in initial 
post. 

Original comment by gregor.g...@gmail.com on 4 Jan 2013 at 1:51

Attachments:

GoogleCodeExporter commented 9 years ago
look at the last diff under change log for msformat.cpp
http://code.google.com/p/macs/source/browse/trunk/msformat.cpp
and check out latest to see if this fixes it

Original comment by gche...@gmail.com on 4 Jan 2013 at 1:56

GoogleCodeExporter commented 9 years ago
Yes, now things look ok! Attached is a plot of allele frequencies (y axis) over 
sites (x axis).

Thanks!!!

Original comment by gregor.g...@gmail.com on 4 Jan 2013 at 2:03

Attachments:

GoogleCodeExporter commented 9 years ago
great thanks for pointing this bug out.

Original comment by gche...@gmail.com on 4 Jan 2013 at 2:04