[FEAT EXTRACTION] Reproducibility of GLSZM features

AIM-Harvard / pyradiomics

Open-source python package for the extraction of Radiomics features from 2D and 3D images and binary masks. Support: https://discourse.slicer.org/c/community/radiomics

http://pyradiomics.readthedocs.io/

BSD 3-Clause "New" or "Revised" License

1.11k stars 485 forks source link

[FEAT EXTRACTION] Reproducibility of GLSZM features #792

Open asmaharry opened 1 year ago

asmaharry commented 1 year ago

We observed the differences in GLSZM features in two files, File1 contains the features that were extracted 6 months ago. File2 contains recently extracted features. Both File1 and File2 are generated on the same system. When I compared File1 with File2, the maximum error is 6digit number.

Then I restart my system and extract features for File2. Then I noticed that now the maximum difference between File1 and File 2 is minimum. After restarting the system radiomics features are changed.

Anyone noticed the same problem or have any idea what is happening here? Why I am not able to reproduce the texture features for the same dataset? @JoostJM

Thanks.

JoostJM commented 1 year ago

On what system are the files generated? How do you compare? Is there a difference in PyRadiomics versions? Pyradiomics includes many tests that prevent any change in calculated feature output from occurring accedentally. When feature output changes due to bug-fixes, the baseline is updated. This is logged in the changelog.

In the past I have noticed some users trying to open the output file (csv) using Excel in the wrong region setting (output culture in PyRadiomics is en-US, with "." being the decimal symbol. When opening in excel using "," as decimal symbol, the value get's transformed to a large integer). Is it possible this occurred in your case?

asmaharry commented 1 year ago

Thank you for the response. I am using Ubuntu system, and extract the features using the same versions of softwares(pyradiomics), Extracted features are saved in a csv format (new_X_df is dataframe that contains features ) csv_filename = os.path.join(PathToCSVs, filename) new_X_df.to_csv(csv_filename, index=False)

then reading it like this data = pd.read_csv(filepath) X1 = data.to_numpy() then compare the two numpy arrays (X1 and X2)

Please elaborate if you feel any problem here. I also observed that the features that cause this problem are wavelet-based features. Many thanks.

JoostJM commented 1 year ago

The part where you go to csv, then back means your values are converted to strings and therefore subject to current culture. I suspect your error is occurring there. What happens if you save and load using pickle?