ESHackathon / CiteSource

http://www.eshackathon.org/CiteSource/
GNU General Public License v3.0
16 stars 2 forks source link

Bar chart inaccurate #149

Closed TNRiley closed 1 year ago

TNRiley commented 1 year ago

See CAB for a simple example. Bar chart shows only 2 records included at final (both unique), however, the table shows 4 and the record level table confirms that there are 2 unique records at final and 2 duplicate records.

captures_chrome-capture-2023-5-8 (2) captures_chrome-capture-2023-5-8 (1) captures_chrome-capture-2023-5-8

LukasWallrich commented 1 year ago

I would expect that the bar chart shows 2 twice - just on top of each other. Where is this data?

Will try to shift the labels - we will often have very small bars, so that they should sit above and below, rather than within ...

LukasWallrich commented 1 year ago

I pushed adjusted code - pls check if that fixes your issue.

Actually, I didn't like the look of labels on top/bottom:

image

Instead, I propose keeping them in the middle, but with a minimum distance from the x-axis, like this

image

In any case, I hope this was the issue you faced.

TNRiley commented 1 year ago

@LukasWallrich The number location looks much better. However, I'm now seeing 1 unique for CAB and 2 duplicate. So it's missing one. It also looks like the table only shows 3 for the final included. This must be due to the other changes you pushed, as it is impacting the table....

image image

TNRiley commented 1 year ago

I confirmed that the changes to the read_citations was the issue. When only_key_fields = FALSE you get the full 4 back in there. I can take a look to see which element we need to add back as a key field. image

TNRiley commented 1 year ago

@kaitlynhair - Lukas added the following as the key fields. I don't see anywhere that ASySD would be considering any further fields than these in the deduplication process. Can you tell why stripping records of all fields other than these would cause a difference in results? Is type considered (journal, book, etc.)?

"author", "title", "year", "journal", "abstract", "doi", "number", "pages", "volume", "isbn", "record_id", "label", "source", "issue", "url"

TNRiley commented 1 year ago

Had a chance to look at things a bit more. Comparing the unique_citations data from each I've found some differences. When key fields is FALSE it appears that pages, start page, and ISSN are stripped of their content. Of the three only the "pages" field is used in identifying dups, so I'm guessing that is the problem. However, I'm not sure why the changes Lukas implemented would be stripping that metadata. Here is a screenshot of records sorted by title so rows align.

image

LukasWallrich commented 1 year ago

Thanks for digging into this - super helpful!

I think the issue is that I only considered the fields ASySD uses internally - not the ones that are alternatively merged into them. ISBN is made up of ISBN and ISSN, pages is made up of pages and start_page etc. I have now added them to the key fields - could you try to see if that resolves it?

TNRiley commented 1 year ago

Awesome, looks good on my end. I hadn't caught that those were merged fields to begin with. I'd be interested in why @kaitlynhair decided on that.