biostars / biostar-handbook

Issue tracker for the Biostar Handbook
57 stars 12 forks source link

goa_human.gaf file headers counted as genes #80

Closed JWDebler closed 5 years ago

JWDebler commented 5 years ago

In the examples under "Understanding the GO data" we are asked to run these commands:

cat goa_human.gaf | cut -f 3 | head -20

# Gene names are in column 3. Find the unique ones.
cat goa_human.gaf | cut -f 3 | sort | uniq -c | head -20

# How many unique genes are present?
cat goa_human.gaf | cut -f 3 | sort | uniq -c | wc -l
# 19719 (this number changes when the data is updated)

However, these do not take into account that there are 31 lines of comments at the beginning of the file, therefore the output of cat goa_human.gaf | cut -f 3 | sort | uniq -c | head -20 begins with:

11 ! 
4 !=================================
6 A0A075B6Q4
2 A0A087WT57
4 A0A087WUV0
1 A0A087WVE0
ialbert commented 5 years ago

Apologies for the error. It is a remnant from the 1st edition where we filtered the GAF file. The code has been corrected to contain:

# Uncompress and remove lines starting with the comments character!
# from the annotation file. This makes subsequent commands simpler.
gunzip -c goa_human.gaf.gz | grep -v '^!' > tmp

# Move the temporary files under the original name.
mv tmp goa_human.gaf

the correction has been applied to the content on the site and all editions of the book.