PeanutBase / PeanutbaseWebsite

Repo to document and track issues pertaining to PeanutBase website.
0 stars 0 forks source link

African breeding lines in gigwa: space in individual name #5

Open sdash-github opened 4 years ago

sdash-github commented 4 years ago

African breeding lines genotype data has been loaded into PB gigwa (https://peanutbase.org/data/public/Arachis_hypogaea/arahy.gnm1.div.LZ50/), but:

On 2019/12/13 10:02 AM, Jean Francois Rami wrote:

Dear all, Thank you for your quick reply!

We just looked into the file with Daniel and we realized that there has been probably an issue during the import in gigwa with individual names that contain a space. Indeed the hapmap file contains 1151 individuals while the gigwa db contains only 913 individuals. And many names are different and seems to be in gigwa a part of a name that contains spaces in the original file. That would be worth testing reimporting the data in gigwa after replacing spaces by e.g. "_" Well, this case is actually a good incentive for using in the future sample unique IDs generated by a database like BMS, and retrieving germplasm IDs and attributes through the BrAPI directly from gigwa. This is one of the first use cases we want to work on with Mariano and Guilhem. Best JF

So the 1151 vs. 913 individuals need to be sorted out at PB gigwa

sdash-github commented 4 years ago

Will take the approach of substituting space with '_' and reload into gigwa.

Verified space substitution with '_' would work

data/public/Arachis_hypogaea/arahy.gnm1.div.LZ50]$ zcat arahy.gnm1.div.LZ50.snp_chip.hmp.gz | head -1 | perl -pe 's/\t/\n/g' | tail +12 | wc -l
    1151
Total 1151 headings for individuals.
$ zcat arahy.gnm1.div.LZ50.snp_chip.hmp.gz | head -1 | perl -pe 's/\t/\n/g' | tail +12 | grep '[[:space:]]' | wc -l
     544  # individuals with space in the name
$ zcat arahy.gnm1.div.LZ50.snp_chip.hmp.gz | head -1 | perl -pe 's/\t/\n/g' | tail +12 | grep -v '[[:space:]]' | wc -l
     607  # without space in name
$ zcat arahy.gnm1.div.LZ50.snp_chip.hmp.gz | head -1 | perl -pe 's/\t/\n/g' | tail +12 | tr ' ' '_'  | grep '[[:space:]]' | wc -l
       0  
$ zcat arahy.gnm1.div.LZ50.snp_chip.hmp.gz | head -1 | perl -pe 's/\t/\n/g' | tail +12 | tr ' ' '_'  | grep -v '[[:space:]]' | wc -l
    1151  # space to '_' works
sdash-github commented 4 years ago

Space substitiuted with '_'

Backed up then:

$ zcat arahy.gnm1.div.LZ50.snp_chip.hmp.gz | sed '1s/ /\_/g' | gzip > arahy.gnm1.div.LZ50.snp_chip.SpaceRemoved.hmp.gz

Original file removed and modified file renamed to original.

Verified

$ zcat arahy.gnm1.div.LZ50.snp_chip.hmp.gz | head -1 | perl -pe 's/\t/\n/g' | tail +12  |  grep '[[:space:]]' | wc -l
       0  # no header name with space
$ zcat arahy.gnm1.div.LZ50.snp_chip.hmp.gz | head -1 | perl -pe 's/\t/\n/g' | tail +12  |  grep -v '[[:space:]]' | wc -l
    1151 

Now file ready for loading into gigwa

sdash-github commented 4 years ago

Loaded in PB-stage with Ethy after deleting the previous one. Now 1145 individuals loaded into gigwa db out of 1151 in file (earlier it was 913). Instructions in PB project notes doc from Ethy shared with us.

TO DO: Email to them when data in production after rollover.

sdash-github commented 4 years ago

Updated in DS. TO DO: Need to convert .hmp to flapjack format.

sdash-github commented 4 years ago

@adf-ncgr: Hi Andrew, From my quick reading it sounds like flapjack format is different from the VCF format? Do I need the flapjack installed to convert our .hmp file to flapjack?

adf-ncgr commented 4 years ago

flapjack can import a few formats, but I'm not sure hmp is one of them. I have a vcf converter that could probably be tweaked for this purpose, though. You probably should get flapjack installed on your laptop, though, as it will be needed to produce the flapjack format (which is really a sqlite3 db file, representing a sort of session file for flapjack). let's discuss some more when we meet tomorrow, as I think there may be some minor problems with the files (my laptop is really struggling to load the flapjack file that's in there; also, the hmp file does not seem to actually refer to arahy.gnm1 as the folder would imply)