EricArcher / strataG

strataG is a toolkit for haploid sequence and multilocus genetic data summaries, and analyses of population structure.
25 stars 12 forks source link

Misspelt word in the code for read.arlequin function #25

Closed harbi811 closed 4 years ago

harbi811 commented 4 years ago

The word FREQUENCY is misspelt as FREQUNENCY and for that reason, the function is unable to recognise Arlequin files whose DataType is FREQUENCY

read.arlequin function (file) { arp <- scan(file, what = "character", sep = "\n", quiet = TRUE) getValues <- function(title, arp) { x <- grep(title, arp, ignore.case = TRUE, value = TRUE) gsub("^[[:alnum:]]+=", "", x) } title <- gsub("[[:punct:]]+", "", getValues("Title", arp)) data.type <- gsub("[[:punct:]]+", "", getValues("DataType", arp)) if (!toupper(data.type) %in% c("FREQUNENCY", "DNA", "MICROSAT")) { stop("DataType must be of FREQUENCY, DNA, or MICROSAT") } missing.data <- gsub("[']|\"", "", getValues("MissingData", arp)) locus.separator <- gsub("[']|\"", "", getValues("LocusSeparator", arp)) sample.name <- gsub("[']|\"", "", getValues("SampleName", arp)) data.start <- grep("SampleData", arp, ignore.case = TRUE) + 1 data.end <- grep("[}]", arp)[1:length(data.start)] - 1 split.char <- switch(locus.separator, WHITESPACE = "[[:space:]]", TAB = "[[:space:]]", NONE = "", locus.separator) gen.data <- do.call(rbind, lapply(1:length(data.start), function(i) { lines <- strsplit(arp[data.start[i]:data.end[i]], split = split.char) lines <- lapply(lines, function(x) { x <- x[x != ""] x[x == missing.data] <- NA x }) max.len <- max(sapply(lines, length)) mat <- do.call(rbind, lapply(lines, function(x) { if (length(x) < max.len) c(rep(NA, max.len - length(x)), x) else x })) for (r in 2:nrow(mat)) { if (is.na(mat[r, 1])) mat[r, 1] <- mat[r - 1, 1] } cbind(rep(sample.name[i], nrow(mat)), mat) })) haploid.mat <- function(x) { new.mat <- do.call(rbind, lapply(1:nrow(x), function(i) { mat <- x[rep(i, as.numeric(x[i, 3])), , drop = FALSE] if (nrow(mat) > 1) mat[, 2] <- paste(mat[, 2], 1:nrow(mat), sep = "") mat[, 2] <- paste(mat[, 2], x[i, 1], sep = "") cbind(mat[, c(2, 1), drop = FALSE], rep(x[i, 2], nrow(mat))) })) colnames(new.mat) <- c("id", "strata", "gene") new.mat } switch(toupper(data.type), DNA = { new.mat <- haploid.mat(gen.data) seq.mat <- gen.data[, c(2, 4), drop = FALSE] seq.mat <- seq.mat[!duplicated(seq.mat[, 1]), , drop = FALSE] seq.mat <- seq.mat[order(seq.mat[, 1]), ] dna <- lapply(strsplit(seq.mat[, 2], ""), function(x) { x[x == missing.data] <- "n" tolower(x) }) names(dna) <- seq.mat[, 1] dna <- as.DNAbin(dna) df2gtypes(new.mat, ploidy = 1, sequences = dna, description = title) }, FREQUENCY = { new.mat <- haploid.mat(gen.data) df2gtypes(new.mat, ploidy = 1, description = title) }, MICROSAT = { ploidy <- unname(table(gen.data[, 2])[1]) new.mat <- do.call(rbind, split(gen.data[, -(1:3)], gen.data[, 2])) freqs <- na.omit(gen.data[, 1:3]) new.mat <- cbind(freqs, new.mat[freqs[, 2], ]) new.mat <- do.call(rbind, lapply(1:nrow(new.mat), function(i) { x <- new.mat[rep(i, as.numeric(new.mat[i, 3])), , drop = FALSE] if (nrow(x) > 1) x[, 2] <- paste(x[, 2], 1:nrow(x), sep = "_") rownames(x) <- x[, 2] x })) new.mat <- new.mat[, c(2, 1, 4:ncol(new.mat))] loc.names <- paste("Locus", 1:(ncol(gen.data) - 3), sep = "") loc.names <- paste(rep(loc.names, each = ploidy), 1:ploidy, sep = ".") colnames(new.mat) <- c("id", "strata", loc.names) df2gtypes(new.mat, ploidy = ploidy, description = title) }) } <bytecode: 0x7fd2f54b90e8>

EricArcher commented 4 years ago

Thanks for catching this. I've fixed it in the big redevelopment I'm working on and hope to release in early February. It is largely stable now and can be installed with:

devtools::install_github("ericarcher/stratag", ref = "redevel2018")

harbi811 commented 4 years ago

Thank you @EricArcher for the clarification. I have a humble request. May you please provide the test Arlequin files that you used for testing the function. I'm still unable to use the function. I would love to compare the files with my files.

EricArcher commented 4 years ago

I'll try to include them in the test data of the new release. I'm hopeful to have that ready in a couple of weeks.

EricArcher commented 4 years ago

I've entirely rewritten the arlequin parsing function (now 'arlequinRead()') and tested it with all of the example .arp files that ship with arlequin. @harbi811, would you mind installing the 'redevel2018' version of strataG and trying out 'arlequinRead()' and 'arp2gtypes()' on your files?

harbi811 commented 4 years ago

Thank you very much @EricArcher. I am able to parse my files. 😀😀