BiologicalRecordsCentre / sparta

Species Presence/Absence R Trends Analyses
http://biologicalrecordscentre.github.io/sparta/index.html
MIT License
21 stars 24 forks source link

Frescalo stat output Strange Location names #85

Closed EichenbergBEF closed 5 years ago

EichenbergBEF commented 5 years ago

Dear sparta Developers, dear Tom (I guess) I have run frescalo with a very big datset lately. I have 12024 locations with a ot of data (28 Mio entries) aggregated to 3 time periods. In total, the agorithm summary gives:

Actual numbers in data Number of samples 12024 Number of species 3201 Number of time periods 3 Number of observations 6434890 Neighbourhood weights 1202400 Benchmark exclusions 0 Filter locations included 0

When I look at the output (specifically the Stats file), I see strange locations, that all beginn with an S....

An exceprt:

 

Location | Loc_no | No_spp | Phi_in | Alpha | Wgt_n2 | Phi_out | Spnum_in | Spnum_out | Iter 8173 | 9997 | 557 | 0.655 | 1.74 | 32.01 | 0.74 | 472.4 | 603.4 | 4 8174 | 9998 | 479 | 0.665 | 1.65 | 36.20 | 0.74 | 517.8 | 643.8 | 4 8175 | 9999 | 587 | 0.751 | 0.94 | 34.41 | 0.74 | 479.0 | 467.1 | 4 8176 | 10000 | 505 | 0.906 | 0.48 | NA | 0.74 | 495.4 | 389.6 | 32 S1 | 10001 | 0 | 0.740 | 1.00 | NA | 0.74 | 606.7 | 606.7 | 1 S10 | 10002 | 0 | 0.740 | 1.00 | NA | 0.74 | 554.2 | 554.2 | 1 S100 | 10003 | 0 | 0.740 | 1.00 | NA | 0.74 | 598.1 | 598.1 | 1 S1000 | 10004 | 0 | 0.740 | 1.00 | 3837.97 | 0.74 | 533.7 | 533.7 | 1 .....

Such locations are neither in my input file, nor are they in my neighbourhood file. However, they seem to be the numbers that are ascribed to the species in my data that I analyze.

Moreover, the numbers in the stats file for these locations are quite odd (often NA, see above). I did research over and over again, and found the following at the beginning of the "log" file:

Log file for FRESCALO

Input limits: Number of samples = 10000 Number of species = 10000 Number of time periods = 100 Number of observations = 9999999 Number of neighbourhood weights = 5000000

I see that the input limit for samples is 10000; I have a total number of 12024 samples (as can be seen in the summary output above). The "problems" with the locations in the "Stats" file start from 10000 on (see Loc_nr coumn). Could this be the issue? Is there a way to increase the input limit for samples?

Thanks in advance

AugustT commented 5 years ago

It looks like you have done a good job of trying to work out where this issue arose. I would tend to agree that it looks like the samples limitation is causing the bug, I imagine some indexing somewhere then goes wrong and you end up with the effect you report.

Samples means locations, and I have never tried to run on such a large number of locations. You can see the number of observations is much larger, I actually modified this from the original version of frescalo which had a much lower limit and was causing me problems.

In theory this is not a hard problem to solve, we just need to increase the maximum size in Frescalo. The problem is two-fold; first, while I can see no reason to worry, I can't be sure this will not have any side effects, so you would need to watch out for any odd behaviour/results. Secondly it requires someone to edit the original fortran and recompile the .exe. Are you in a position to do this? I attach the code for Frescalo_3a here, the line to edit is right at the start. Marks notes say mm is the number of samples. Once you compile to an .exe just use that in place of your current version.

Frescalo_3a.txt

EichenbergBEF commented 5 years ago

Thanks, Tom. Very nice to share the code! This did the trick! We managed to comiple it using gfrotran, after renaming the file to *.for.

As a hint (e.g. for all other users with prolems in compiling the source code): Check the code carefully for TAB characters, as opposed by the needed 6 SPACEs at the beginning of lines. Sometimes the editor you use tends to add the TABS without you noticing. Moreover, be carefull not to increase the number of locations too high; this may cause your machine to run out of memory. If you are really in need of a specific configuration: adapt other limits (e.g. max species numbers, numbers of timesteps) to your needs and you may well prevent an excessive use of memory... or get a bigger machine ;-)

I consider this issue as solved!

AugustT commented 5 years ago

@EichenbergBEF Thank you for detailing your solution, I will now close this issue.