Biometris / statgenGWAS

See https://biometris.github.io/statgenGWAS for a full description
https://biometris.github.io/statgenGWAS/
12 stars 3 forks source link

Problem with input files #7

Closed kostasgalexiou closed 1 year ago

kostasgalexiou commented 2 years ago

Hi,

When I try to generate the gData object, I get the following error:

> gdata <- createGData(geno = genotypes_transp, map = marker_map, pheno = pheno) Warning messages: 1: map contains no marker names. Names constructed from chromosome and position. 2: geno contains no genotype names. Names taken from pheno. 3: Not all markers in geno are in map. Extra markers will be removed.

In spite of the warning I carry on with the vignette to clean the data and I get the following error:

> gdata_clean = codeMarkers(gData = gdata, verbose = TRUE) Error in codeCharMarkers(markersClean, refAlls, refAll[1] == "minor", : Column index is out of bounds: [index=0; column extent=0].

I would appreciate any help. Below are the input files that I am using:

phenotyping file: pheno_present_in_genotyping.txt genotyping file: transposed_geno.txt marker map file: geno_map.txt

I would appreciate any help.

Gracias, Kostas

BartJanvanRossum commented 2 years ago

Thanks for sharing your problem. Since you didn't share the code you used I had to guess a little to get at the error message you got. I think it was something along these lines.

library(statgenGWAS)

marker_map <- read.table("geno_map.txt", header = TRUE, sep = "\t")
genotypes_transp <- read.table("transposed_geno.txt", header = TRUE, sep = "\t")
pheno <- read.table("pheno_present_in_genotyping.txt", header = TRUE)

gdata <- createGData(geno = genotypes_transp, map = marker_map, pheno = pheno)

This gives the warning messages you display above. These messages are there for a reason and in this case reading the function documentation properly and making some small modifications to your data fixes the problem.

As explained in the documentation, the map should have rownames that correspond to the markers. The same holds for the genotypic file.
So adding this to you code fixes the problem:

rownames(marker_map) <- marker_map$marker.names
rownames(genotypes_transp) <- genotypes_transp$Indv
genotypes_transp <- genotypes_transp[, -1]

gdata <- createGData(geno = genotypes_transp, map = marker_map, pheno = pheno)

In your original example you ended up with no genotypes after the createGData step, simply because the function didn't know how to match the map, geno and pheno files. I agree that the error message that is then produced by codeMarkers is extremely unclear. I will have a look at improving that in the next release.

kostasgalexiou commented 2 years ago

Hi @BartJanvanRossum,

Thanks for your quick reply! After adding the commands that you suggested, things are working properly. I got confused because I was trying to replicate the format of dropsMap, dropsMarkers and dropsPheno objects.

codeMarkers() is working properly as well.

Many thanks

kostasgalexiou commented 2 years ago

i have one more question, resulting from generating the data object. When i run summary(gData) I get the following general information

map

Number of markers: 32660 
Number of chromosomes: 28 

markers

Number of markers: 32660 
Number of genotypes: 317 
Content:
 0 1 2 NA 
 0.5 0.36 0.14 0 

pheno

Number of trials: 1 

pheno:
    Number of traits: 317 
    Number of genotypes: 317

So the object sees that I have 317 genotypes with 317 different traits, whereas I have 317 genotypes with 1 trait. I suppose there a formatting problem with the pheno file?

BartJanvanRossum commented 2 years ago

Thanks for noticing. This actually is a bug in the package. It happened because you only have one trait in your data. Surprisingly this has gone unnoticed for a long time.
I fixed the bug in the development version of the package. You can install it from github by running this line in R:


remotes::install_github("Biometris/statgenGWAS", ref = "develop", dependencies = TRUE)
kostasgalexiou commented 2 years ago

Thanks!

MARCPUCV commented 2 years ago

Good afternoon everyone, I have a question related to generate the Gdata class object to use it later for the coding and imputation of missing data with the codeMarkers () function using Beagle imputation method. There will be an example since I don't know what the format of the files should be that should be generated in Beagle and then generate the gData with that information. Beforehand thank you very much. Kind regards!

BartJanvanRossum commented 2 years ago

I am not sure I understand your question. Creating a gData object for imputation with beagle is exactly the same as creating any other gData object. When you call the codeMarkers function and tell it to use Beagle for imputation, the function will take care of converting the format to a format that Beagle can read, then do the imputation and convert the output from Beagle back to the gData format. So there is no need to do anything different when you create your gData object.
I hope this clarifies things a bit. If not, please let me know

MARCPUCV commented 2 years ago

Hi @BartJanvanRossum, I get this error and I don't know why? Could it be that the Gdata object is formed with files that are not in Beagle format? or what is this error due to? because the Gdata object has the information of the genotypes and the genetic map and those files are .txt they are not in Beagle format. Thank you very much in advance, best regards !! Error in read.table(gzfile(paste0(tmpVcfOut, ".vcf.gz")), stringsAsFactors = FALSE) : no lines available in input Además: Warning message: In system(paste0("java -Xmx3000m -jar ", shQuote(paste0(sort(path.package()[grep("statgenGWAS", : comando ejecutado 'java -Xmx3000m -jar "C:/Users/angel/Documents/R/win-library/4.1/statgenGWAS/java/beagle.jar" gt=C:\Users\angel\AppData\Local\Temp\RtmpuMTqcO\file6a5c707611c6input.vcf out=C:\Users\angel\AppData\Local\Temp\RtmpuMTqcO\file6a5c2a363ff6out gp=true seed=1234 nthreads=1 map=C:\Users\angel\AppData\Local\Temp\RtmpuMTqcO\file6a5c3bef6acd.map' tiene estatus 1

BartJanvanRossum commented 2 years ago

Hi @MARCPUCV , I don't think the way the gData object is formed is the problem. In principle, if you can do imputation with one of the other methods, you should also be able to do imputation using beagle. The codeMarkers function takes care of all the formatting that needs to be done to get you files ready for beagle. It is very hard to say what is really going on here without having a look at your actual data and script. Would it be possible to share this with me? A small subset would probably do.

MARCPUCV commented 2 years ago

Hi @BartJanvanRossum, I send you a small subset of the data sets to the mail. Thank you so much for all your help!! Confirm received, please. Kind regards!! My email account is maria.rueda.c@pucv.cl

BartJanvanRossum commented 2 years ago

I received your files. I will have a look at it

BartJanvanRossum commented 2 years ago

I had a look at the files. Your gData object is perfectly fine, no problems there. If I try other types of imputation it is also working.
The problem is in the fact that beagle requires the position to be in centiMorgan. Judging from your data it seems your positions are in basepairs.
I just checked the function documentation and the vignette and this isn't mentioned anywhere. I will make sure this gets done in the next version of the package. Sorry for the inconvenience.

MARCPUCV commented 2 years ago

Hi @BartJanvanRossum, Thank you very much for all your help, so I must place the information in cM on the genetic map and generate the Gdata with that information to make the imputation with Beagle. I'll tell you how that goes when I do it. Kind regards!!!

MARCPUCV commented 2 years ago

Hi @BartJanvanRossum, I wanted to tell you to put the information in cM but that information is in decimals when I use the codeMarkers function it gives me the following error and I don't know why? Error in codeMarkers(gData = gData_VAC, impute = TRUE, imputeType = "beagle", : When using beagle imputation gData should contain a map with only integer positions.

image

BartJanvanRossum commented 2 years ago

Hi @MARCPUCV, It is more or less exactly what the message says. The beagle software only accepts positions that are in an integer format. Looking at your screenshot multiplying by 100 and then rounding might do the trick for you.

MARCPUCV commented 2 years ago

Hi @BartJanvanRossum, I tried multiplying and rounding round((100 * G_MAP_VAC$POS_cM), 0), but when the imputation is done almost all the columns are eliminated which is something that should not happen. What would be happening?

image

MARCPUCV commented 2 years ago

Hi @BartJanvanRossum, Could you help me with a screenshot of a genetic map that can be used with the Beagle imputation for reference please. Best regards!!

BartJanvanRossum commented 2 years ago

After some digging done (see #8) it seems that something (not yet sure what) changed in the input format for beagle 5.2 compared to the previous version used in the package (beagle 4.1). For now I fixed this by putting back beagle 4.1. It also seems to solve your problem with the imputation.
You can use your original input, no need to divide by 1000.

You can get the latest version by installing the package from the develop branch:

remotes::install_github("Biometris/statgenGWAS", ref = "develop", dependencies = TRUE)

I will still try to figure out why beagle 5.2 doesn't work, but at the very least you can do the imputation now.

MARCPUCV commented 2 years ago

Hi @BartJanvanRossum, I tell you that I did what you told me in the previous message and the imputation with Beagle worked perfectly considering the bp position on the genetic map. Thank you very much and I managed to make the imputation with Beagle with your indications from the last message. Very grateful for your help. Kind regards. Angélica

BartJanvanRossum commented 1 year ago

Closing this as the original issue is solved and the beagle issue tracked in #8