MoBiodiv / mobr

Tools for analyzing changes in diversity across scales
Other
23 stars 18 forks source link

get_delta_stats vector error on large data set #204

Closed elslabbert closed 6 years ago

elslabbert commented 6 years ago

Hi there, I am trying to run several very large microbial datasets using the MOB package (data has 370 obsv. and > 14500 variables). So far everything has run smoothly, but then I get (?a computational error) error at the get_delta_stats function.

Script: site1_mob_in = make_mob_in(site1_dat, site1_coords, coord_names = c('lon', 'lat'), latlong = T)

site1_delta_stats = get_delta_stats(site1_mob_in, 'group', ref_group='land_useA',

Any suggestions as to how to overcome this error?

Thanks, elslabbert

dmcglinn commented 6 years ago

hey @elslabbert have you tried specifying the arguments log_scale = TRUE or inds? These arguments help to ensure that the individual based rarefaction curves are not computed at every possible integer. When you have a very large N as is typical in microbial communities these arguments become more important. I would first try log_scale = TRUE and if that still doesn't work then specify inds you'll have to decide either 1) how many points you want on the individual rarefaction curve or 2) the actual numbers of individuals you want to compute results for as inds can be used to specify either of these. Let us know how it goes - so far we don't have many folks using the package yet on microbes so I'm eager to see if its possible.

dmcglinn commented 6 years ago

I noticed that as implemented currently the argument log_scale does not spread effort across a log scale unless the argument inds is set to an integer which indicates the number of points to compute the curves at. I'm going to fix this so that inds can be left as NULL but if log_scale is set to TRUE that the efforts are spread across a log base 2 range.

elslabbert commented 6 years ago

Hi dmcglinn, I initially had the log_scale = TRUE, which didn't work (perhaps because of that you mention in your last post). Ran it again, for if you had managed to fix the code as described above, but it still gave me the same error. I then ran it with inds set to inds = 100. That seems to be running, although it is still taking some time (been running for >60min). Guess I should try running it with a higher integer value?

elslabbert commented 6 years ago

Hey @dmcglinn, so I have moved on to running my code on an even larger data set (>62 000 OTU's) and have set the commands as discussed above. With log_scale = TRUE, and inds = first as 1000 then 10 000. But get the error again about the get error of the function not being able to allocate such a large vector (cant allocate vector of 912.7 MG). Here is the complete code:

`for(i in 1:length(dat.list)){

create mob input

mob_in.list[[i]] <- make_mob_in(comm = dat.list[[i]], plot_attr = my_coords[[i]], coord_names = c('lon', 'lat'), latlong = T)

two-scale analsyses

mob_result.list[[i]] <- get_mob_stats(mob_in.list[[i]], group_var = "group", n_perm=99)# reduced permutations from 199 what was set on initially (for the plants & Fungi) to 99 for due to very large bacteria dataset. Re-ran the plants and fungi to standardize the analyses across taxa

continious scale

for fungi and bacteria data make inds = 1000 / 10 000 respectively, aro size of dataset, but not neccessary for plant dataset

delta_result.list[[i]] <- get_delta_stats(my_mob_in, 'group', ref_group='Pasture', type='discrete', log_scale=TRUE, n_perm=99) # n_perm 99 for all three taxa }`

elslabbert commented 6 years ago

@dmcglinn ...so I tried it again and let it run over the weekend, with the same settings, and it ran to completing. But also go 50 warning messages for the data set saying: In anova.lm(mod) : ANOVA F-tests on an essentially perfect fit are unreliable.

How would you recommend I address this to make the results more reliable?

dmcglinn commented 6 years ago

Hey @elslabbert thanks for these updates on this issue. I'm not confused why get_delta_stats would generate the anova warnings because that function never calls anova. The function get_mob_stats does call that function when computing the F-value which is uses as a test statistic for permutation tests. Is it possible that this is a warning from running get_mob_stats instead of get_delta_stats? Also that warning indicates that you have zero or maybe one degree of freedom (its like a t-test when n = 2).

elslabbert commented 6 years ago

Hi @dmcglinn thanks for the prompt follow up. The loop I have set-up runs through both of these functions in sequence, so the warning messages are then from get_mob_stats, not get_delta_stats. I am running one of the datasets separately through these two steps to double check.

Regarding the last sentence in your comment above: despite an attempt at a balanced sampling design, the data I' using is not 100% equal in its no. of replicates/sampling effort per treatment across sites. Some sites have less replicates per treatment group than others (e.g. site 1 equal sampling effort (30:30), but at site 2 there are only 10 replicates of treatment A and 30 or treatment B). Could this be causing the issue?

dmcglinn commented 6 years ago

Is this issue you resolved now? Thanks!

dmcglinn commented 6 years ago

I can confirm that specifying log_scale = TRUE in the function get_delta_stats now does actually reduce the number of sampling points as expected. See #219 for further update to this function to catch a corner bug.