Open GoogleCodeExporter opened 8 years ago
Iain, I can't import the project attached. It appears to be corrupted. Could
you export just the data?
Original comment by dfhu...@google.com
on 28 Sep 2010 at 4:46
Attached is the original data file I imported.
ClinicalTrials.gov only outputs in individual xml files (see issue 131), so I
had an external script stitch them into a single xml file.
Original comment by iainsproat
on 28 Sep 2010 at 8:20
Attachments:
I can repro the bug, although I haven't figured out how to fix it. Exporting
the data as HTML table works--all the records seem to be there. But inside
Refine, only the first 2 records are shown. I guess the JSON stream got cut
off? Which is weird because that would cause a syntax error.
Original comment by dfhu...@google.com
on 28 Sep 2010 at 7:18
Attached is a smaller version of the file which should make debugging a little
easier.
Part of the problem was the multi-line text elements were confusing the parser,
but I've just committed a change that should fix that piece. It's still
splitting a single record into multiple rows when it shouldn't be. I'll take a
look and see why.
As an aside, I think the reason the browser goes berserk is that it's trying to
deal with a large number of null cells. Fixing the project import will
probably solve this, but someone who's got stronger Javascript fu than I might
want to take a look at whether there's another problem lurking there.
Original comment by tfmorris
on 27 Nov 2010 at 10:14
Attachments:
I think I've figured out what the problem is here. The column groups for this
XML fragment are getting computed incorrectly:
<sponsors>
<lead_sponsor>
<agency>National Center for Research Resources (NCRR)</agency>
<agency_class>NIH</agency_class>
</lead_sponsor>
<collaborator>
<agency>HRSA/Maternal and Child Health Bureau</agency>
<agency_class>U.S. Fed</agency_class>
</collaborator>
</sponsors>
Refine is currently computing three column groups from this: lead_sponsor (2
columns), collaborator (2 columns), and sponsor (all 4 columns). This last
group is triggering unnecessary row dependencies when there is no collaborator
element.
I'm tempted to say that column groups which consist of nothing but other
groups, without any individual ungrouped columns of their own should be
eliminated, but it's a fairly critical piece of code, so I want to look at it a
little more closely.
p.s. I suspect the reason for the sluggish browser performance is that it was
trying to deal with the entire file at once since the monster second record
effectively disabled the paging.
Original comment by tfmorris
on 14 Oct 2011 at 6:49
Attachments:
Reversing course, I now think that perhaps we should be computing row->record
dependencies directly since we have that information from the parse. We know
what the path to the top level element is and, by definition, all rows that we
create during the parse of its children are part of the same record.
Visually the column groups look like this:
Col # 2222333344445555
Sponsor group SSSSSSSSSSSSSSSS
Lead/Collab CCCCccccLLLLllll
The three groups are group S, group Cc, and group Ll.Note that the columns have
been reordered with the Collaborator columns before the Lead columns. Group S
is the problematic one. Since the Collaborator sponsors are optional, that
means that the "key" column #2 can be blank for a top level record, causing it
to get merged with the previous record (the algorithm searches back for a row
with a non-blank "key" cell if any of the cells in the column group are
non-blank).
Group S is a real column group (consisting solely of other column groups), but
since we don't appear to use column groups for anything, I'm not sure what
value it has.
I think there are two options here:
1. Eliminate column groups which consist solely of other column groups from the
dependency analysis
2. Compute row dependencies using a different method than column groups (eg use
the tree structure directly from the parse).
Opinions?
Original comment by tfmorris
on 27 Oct 2011 at 12:04
This is fixed, at least well enough for this example, with r2347. The cell
data counts weren't getting updated before sorting the column groups, causing
them all to be zero which meant data rich columns weren't getting place where
Refine needed them to be for the key column.
I think there's still an underlying issue where the current counting strategy
can get fooled into choosing the wrong column. For example if Column A is
mandatory but never has more than a single value and Column B values are
optional, but high frequency (e.g. 50% of the records, but every record has 3
values, giving it 1.5x the number of cells as column A), then column B will get
chosen in preference to column A.
There's also the case where no single column in a column group always has a
value (e.g. a column group where either column A OR column B is populated for
any given record).
Original comment by tfmorris
on 28 Oct 2011 at 7:56
Reopening. The theoretical issue that I suspected has been confirmed in the
wild by a new XML file received from a commenter on issue 393, so we're going
to need a different approach to dealing with column groups.
Original comment by tfmorris
on 6 Nov 2011 at 9:14
Issue 393 has been merged into this issue.
Original comment by tfmorris
on 6 Nov 2011 at 9:26
Bump all unfinished 2.5 Milestone tags to next release
Original comment by tfmorris
on 12 Dec 2011 at 7:56
[deleted comment]
[deleted comment]
Hello all,
Outside user, working to clean-up someone else's messy server. After importing
three .csv files that total 5019 records in length, I delete the the 'file'
column as it is not needed. The record count then abates to 1333. I do not
have any facets selected. When attempting click around and discover the source
of the problem, Firefox tells me that I have an unresponsive script:
http://127.0.0.1:3333/project-bundle.js:7973
And asks me if I wish to continue or not. If I continue, it will eventually
ask me again. If I stop the script, it will crash the tab I am running Refine
in.
I converted my original .csv files into .tsv format, and reopened them in
Refine. This solved my issue.
Quirky little bug with a simple workaround for those of you out there in
Userland that are not quite up to speed in programing languages.
Original comment by emilytgriffiths
on 13 Apr 2012 at 2:56
Since you're working with CSV, not XML, it's clearly not related to this bug.
Please feel free to post on Google Refine mailing list if you'd like
assistance. It sounds like you probably had a mostly blank initial column in
one or more of your CSVs. You can either process it in "row" mode (instead of
"record" mode) or shuffle the columns around so it's not an issue.
Original comment by tfmorris
on 13 Apr 2012 at 4:58
Could someone provide a pointer on how does the record detection exactly work?
Is it import-time, or runtime? Where can I find the code responsible for this?
When I provide a properly structured xml, it merges some of the records into
one, but I can't see any features that distinguish those records from the
others... I'd like to look at the source of the problem, but currently have no
idea where to look.
I'm attaching the offending xml.
Thanks
Original comment by libo...@gmail.com
on 11 Jun 2012 at 11:24
Attachments:
Clicking the rev above that I used to repair part of the problem (r2347) will
get you in the right ballpark. The XML and JSON importers are in the
"tree-shaped" importer family.
ImportColumnGroup is the class which manages the column groups that are used to
determine dependent rows/records. Anything that references it is probably
involved. TreeImportUtilities and XmlImportUtilities have methods which are
used with this.
Hope that helps get you started. Let us know on the dev list if you have any
questions.
Original comment by tfmorris
on 12 Jun 2012 at 8:15
Original issue reported on code.google.com by
iainsproat
on 22 Sep 2010 at 10:02Attachments: