cppisking / ffrk-inspector

Issues tracker for FFRK Inspector
10 stars 8 forks source link

Reusing old DB data #23

Closed onefeline closed 9 years ago

onefeline commented 9 years ago

First, I want to give serious praise to this project and its founder cppisking. This program will replace all of those needless spreadsheets that people post on the reddit and will give REAL and HARD data to back up claims about most efficient farmable places in FFRK. This is an amazing piece of work, and we should all be thankful for all of the hard work that has been done.

That being said, I've been thinking a lot about many of the recent changes that have been introduced and the necessary DB swaps that are needed due to phased-out data. There are a number of issues with this, some of which have already been brought up.

1) The old data is not reused for the calculations. 2) People are not sure how to update to use the new database. 3) Frequently, the old DBs still work without throwing an error, or they produce weird values, such as ID numbers where the item name should be.

All of this can be solved by:

A) Taking down the DB when an upgrade is needed. B) Migrating the data to new tables as needed and adding new columns as necessary. C) MOST SIGNIFICANTLY Preemptively grabbing all data that COULD theoretically be used by FFRK-Inspector. This includes number of rounds per room, number of monsters seen, amount of gil seen, if potions and/or ethers appear, overall damage done to monsters total, overall damage done to party memebers total, etc... D) Bringing the DB back up without requiring a switch.

The earlier this is addressed, the more useful our data gathering will be. I myself ran Vector Streets over 33 times in the past few days, got lots of useful data, but it was all for naught. By invalidating the old databases, people become discouraged to continue collecting data because they are not sure that the data that they collect will stay in a meaningful way.

I have no problem, by the way, taking on some of these changes. It might require one more DB transfer, it might not. I myself love to program, so it would be a thrill if we could work in tandem together (as well as take advantage of gits power of version control). However, taking steps to get this done right early on will spell future success for this project, because its usage is undeniable.

cppisking commented 9 years ago

That being said, I've been thinking a lot about many of the recent changes that have been introduced and the necessary DB swaps that are needed due to phased-out data. There are a number of issues with this, some of which have already been brought up.

1) The old data is not reused for the calculations.

Regarding this, all of the old data is being reused. You might be referring to alpha 8, where I mentioned wiping some data. Just to be clear, this was one column in the database that was a brand new column and had only been collected for 24 hours. All of the stuff that's been collected over the past week is still there, and still being used. Even the drops collected yesterday weren't wiped, it was only one column, which was different than the "number of times this battle has been run" column and the "number of times item X has dropped" column.

So don't worry, nothing is lost. I wiped the column i did because it was mathematically impossible to convert it to the new format. But the great thing is that the new format is totally lossless. What i mean is, suppose i record information about 5 runs.

The naive approach is, store the total number of drops in one field. If a certain (battle, item) dropped 0, 2, 3, 1, 0 times I would store the number 6 (total drops) and 5 (times run). From there you can compute the average. I've always been doing that and I'm still doing that.

But this is a lossy encoding of the data stream. It's impossible to computer standard deviation, for example, which is the basis of many important statistical analyses because you don't have a record of every individual drop anyomore.

So 2 days ago i started storing something else. In addition to the sum of all drops, i started storing the sum of squares of the drops. In the above example, I would now have another column with the value 0^2 + 2^2 + 3^2 + 1^2 + 0^2 = 14. From this i can always compute the standard deviation, and i can still compute the average because I'm still storing the simple sum too. So this opens the door to more advanced analysis.

But it's still lossy. I still can't reconstruct the entire data stream. Wouldn't it be nice if I could just store 0,2,3,1,0 in the database directly?

As of vers

cppisking commented 9 years ago

Bleh, hit send too soon. Expect more response coming but I'm on mobile so typing is slow

cppisking commented 9 years ago

As of version 8 this is what i do. for each item it stores a histogram of the drop counts. Since a histogram is completely lossless it's exactly as good as having the full original data stream. So there should be no need to break things again like this again.

2) People are not sure how to update to use the new database. 3) Frequently, the old DBs still work without throwing an error, or they produce weird values, such as ID numbers where the item name should be.

You're right and I definitely need to work on making this better. This is all part of why this is still an alpha. A tool like recordpeeker is very simple because it only has to work for you. Throw a database in with multiple users connecting simultaneously, maintenance, error handling, it becomes pretty difficult. I'll keep working on this though.

C) MOST SIGNIFICANTLY Preemptively grabbing all data that COULD theoretically be used by FFRK-Inspector. This includes number of rounds per room, number of monsters seen, amount of gil seen, if potions and/or ethers appear, overall damage done to monsters total, overall damage done to party memebers total, etc... It actually already does this at startup. But there are some bugs when it will reload from the database and overwrite what you see even if there was an error. So again, could be better but it just takes a lot of time to make all this work.

The earlier this is addressed, the more useful our data gathering will be. I myself ran Vector Streets over 33 times in the past few days, got lots of useful data, but it was all for naught. By invalidating the old databases, people become discouraged to continue collecting data because they are not sure that the data that they collect will stay in a meaningful way.

Your data is still there, as I said before. :) hopefully this post makes everything more clear.

cppisking commented 9 years ago

Just to show you that your data is still here, I ran this query against the live database.

capture

times_run=32 means that I have 32 records of the battle Vector - Streets. times_run_with_histogram=8 means that it's been run 8 times since I pushed alpha 8 (this is the new data format I mentioned). histo_bucket=-1 is a special value that means "the total number of drops even since before we started using the histogram format (because as you mentioned, we need a way to guarantee we don't kill all the hard work everyone has done over the past week). So histo_bucket=-1 histo_value=32 means that among the times_run=32 runs since the beginning, we've observed a total of 32 Leather Gloves. Otherwise, when histo_bucket is not equal to -1, histo_bucket=n histo_value=m means that on m occasions we have gotten n drops in the same run.

So to break it down:

Doing the math, we can see that since alpha 8, 3*1 + 1*2 + 1*4 = 9 Leather Gloves have been obtained. And 3+1+1 = 5 is the total number of times that at least 1 Leather Glove has been seen, but we've run the battle 8 times. So on 8-5=3 occasions, 0 Leather Gloves were obtained.