PeanutBase / jekyll-peanutbase

A "starter" Jekyll site that uses the jekyll-theme-legumeinfo submodule
Apache License 2.0
0 stars 0 forks source link

Gigwa: Migration to NCGR and peanutbase #17

Open sdash-github opened 1 year ago

sdash-github commented 1 year ago

Gigwa container need to be migrated to be housed within NCGR-LIS infra and linked to from PB-Jekyll site.

adf-ncgr commented 1 year ago

just set up a test gigwa2 instance at: http://dev.lis.ncgr.org:50053/gigwa in case you want to start trying to populate a database with an updated version of dataset which needed the naming changes. I changed the admin password from the default but will send it to you.

sdash-github commented 1 year ago

Thanks.

  1. I may not have all the past datasets to load.
  2. I have the latest which seems to be all inclusive in a single datset. 2.1 It will need some work to substitute the >4000 CEL file names with the corresponding genetic line names after getting rid of space character from the genetic line string. I will do this.

On 2022/9/15 4:21 PM, adf-ncgr wrote:

just set up a test gigwa2 instance at: http://dev.lis.ncgr.org:50053/gigwa in case you want to start trying to populate a database with an updated version of dataset which needed the naming changes. I changed the admin password from the default but will send it to you.

— Reply to this email directly, view it on GitHub https://github.com/PeanutBase/jekyll-peanutbase/issues/17#issuecomment-1248644242, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA4A46ZCAVUHNHTJN7DEY3DV6OHMNANCNFSM6AAAAAAQICCUZ4. You are receiving this because you authored the thread.Message ID: @.***>

adf-ncgr commented 1 year ago
  1. I imagine the old datasets are in the datastore in some form, but worst case I think we can export them from the current production instance of gigwa and reload them (or work with Nathan to get some sort of database dump).

2.1 I probably already have a script that could handle the renaming if you want me to take a look at it.

sdash-github commented 1 year ago

Attempting minicore dataset as a pilot. Verifying datastore and export versions have same num of rows in vcf file. v2/Arachis/hypogaea/diversity/aradu1_araip1.gnm1.div.Otyama_Wilkey_2019] arahy.aradu1_araip1.gnm1.div.Otyama_Wilkey_2019.snp_chip.vcf.gz has 15897 rows(grep -v "^#")

Exported from current peanutbase giwa sdmilam/projects/PB-NCGR/GigwaReloading202209/US_Peanut_MiniCore15897variants104individuals.vcf 15897

So should use DS version file for standard and consistency.

sdash-github commented 1 year ago

Upload went well. Needs some more meta info like chip name, etc. for completeness.

sdash-github commented 1 year ago

Unable to edit meta-information like adding more text to description, chip name, etc. So, should delete database and reload after collecting all info. Neither found in doc or in the their publication.

sdash-github commented 1 year ago

Now all three datasets are loaded: Core, Minicore and the all encompassing African lines AG_4057_14471. Working dir: (SDash) sdmilam/projects/PB-NCGR/GigwaReloading202209

sdash-github commented 1 year ago

Forgot the Clevenger_Korani_2018.snp_chip dataset of African lines. Now added with description.

sdash-github commented 1 year ago

Peggy pointed out that she sees numbers instead of genotype names. The spreadsheet they sent has numbers instead of genoype names in many many rows. ex rows 1273 -1296 : a550846-4390129-041321-033_B01.CEL 139915 a550846-4390129-041321-033_B02.CEL 274253 ... ... a550846-4390129-041321-033_B23.CEL 185633 a550846-4390129-041321-033_B24.CEL 270907

The Gigwa, I think displays the individuals after sorting and hence the numbers (as genotype names in the spreadsheet) appear at the top and there are a lot of them.

Requested Peggy to send an updated spreadsheet.

sdash-github commented 1 year ago

On 2022/10/7 7:57 AM, Peggy Ozias-Akins wrote:

Hi Sudhansu,. Before we output another file, I want to be sure I understand the issue. The numbers in your email below are PI numbers although they should be preceded by PI. We can fix that. However, it looks like the script only imported genotype IDs that had numbers and not text. For example, rows 100-101 (and many others) in the spreadsheet show actual sample IDs that correspond to CEL file names. I don’t see a550846-4381366-061020-491_D06.CEL Ug-183_Oug-ICGV SM 02724 a550846-4381366-061020-491_D07.CEL Ug-78_Oug-S.4 X 99044 RED UG Regards, Peggy

My response after checking:

Hi Peggy, I looked for Ug-183_Oug-ICGV-SM-02724 and Ug-78_Oug-S.4-X-99044-RED-UG in the individuals dropdown lookup and found them. Please note that all the spaces have been converted to a '-' char because Gigwa shows erroneous behaviour in loading with IDs containing spaces. So, please look them up with hyphens in place of spaces. Please also note the spreadsheet has 4062 rows but the VCF file has 4057 '.CEL' columns which have been replaced with the genotype names. So five genotypes in the spreadsheet won't be found in this Gigwa dataset-- I don't know which five. Please let me know if there are other issues and I will address them. And thanks for looking at the Gigwa data thoroughly. Sudhansu

sdash-github commented 1 year ago

New spreadsheet is now available. TO DO:
Generate file without spaces in names, generate VCF with new names, delete current dataset and reload.

sdash-github commented 1 year ago

Gigwa comparing haplotypes of two individuals: Sometimes the bottom panel doesn't update. Try showing adf if there is a trick.

sdash-github commented 1 year ago

AG_4057_14471 dataset reloaded after generating necessary reformed files. Work dir: sdmilam/projects/PB-NCGR/GigwaReloading202209 Invited Peggy to look at it. Will close issue after she responds.