Tidy up dataframe creation

SylviaWhittle commented 1 year ago

While refactoring processing.py, it seems to me that we could simplify our handling of results dataframes and in turn reduce errors and unexpected indexes. Currently we have many different points where we create empty dataframes, add indexes and reset indexes. I haven't looked into this extensively yet, but it seems we may be able to simplify this to having just one dataframe being created with all the necessary indexes have that dataframe appended to as opposed to constant merging of many different dataframes.

For example, as I am editing code, I am finding that DNAtracing is crashing when there are no grains, because after tracing, there is no 'image' column in the results dataframe. This is because the 'image' column only gets added inside grainstats, which is an issue if grainstats never completes, or is not run.

I could well be missing some intricacies though!

ns-rse commented 1 year ago

That is a very late night/early morning investigation!

It is certainly possible but may run into issues of concurrency.

The reason empty dataframes are created at the moment, at least at some stages, is because some stages fail but we need to ultimately return a dataframe, from each process_scan() even if its empty because ultimately these are all pd.concat() into a single data frame.

The situation you describe shouldn't arise I don't think because there is no point attempting to run dnatracing if no grains have been found (I've just checked and we only run if dnatracing_config["run"]: and should probably have an additional check to ensure that at least one of grainstats["direction"].shape[0] > 1 (i.e. at least one grain has been found).

If image column is missing it can be added to the ALL_STATISTICS_COLUMN defined in utils.py although I've just checked and it should be the first column (the index is set to molecule_number so not sure where its going) so not sure why its not making it through (line 300 in grainstats.py adds the image column to an existing dataframe with statistics).

Anyway, this is a long winded way of saying that we would probably hit problems if the empty dataframe were created before processing any files and were passed in as an argument to process_scan() for use with GrainStats because you would then have the issue where two images might want to add rows to the dataframe at the same time (I believe this is where things like the Global Interpreter Lock come into play). If a dataframe is to be created up front, I think the best spot to do so would be on instantiating GrainStats and then on lines 268-293 the statistics can be appended.

However, on writing that we would still have the problem that if GrainStats failed there would be no pd.DataFrame to return and include in pd.concat() (although a quick check with pd.concat([None, pd.DataFrame({"a": [1, 2], "b", [3, 4]})]) suggests this shouldn't be a problem.

Hmm, I think I'm :confused: now and would have to look at this in detail and try things out to offer any more thoughts!

MaxGamill-Sheffield commented 1 week ago

I think this has been sorted with the better tracing merger (well... still not tidy but the errors when there's nothing to skeletonise, nodestats, trace or spline are gone). How do people feel about closing this issue?

AFM-SPM / TopoStats

Tidy up dataframe creation #631