Added two fields page_count and page_range.

adsabs / ADSImportPipeline

Data ingest pipeline for ADS classic->ADS+

GNU General Public License v3.0

1 stars 12 forks source link

Added two fields page_count and page_range. #161

Closed golnazads closed 6 years ago

golnazads commented 7 years ago

utils/readrecords.py — modified PROJECT_HOME, was going up too many levels, commented out read_records.INIT_LOOKERS_CACHE() was getting NoneType error, removed underscore beside id in line r[_id] = i. aip/libs/enforce_schema.py — renamed pagenumber to number_pages to match adspy. aip/libs/solr_adapter.py — added page_count and page_range.

coveralls commented 7 years ago

Coverage increased (+0.2%) to 69.259% when pulling 321e0f092b97d11fbe32230c9203833854b68452 on golnazads:master into c741c5c47303cea0434f4a9196eba7efb4bb252b on adsabs:master.

romanchyla commented 7 years ago

is the name agreed? cause i very much prefer 'numpages' instead of longer and hard to type 'page_count' -- numpages is also a standard shorthand

romanchyla commented 7 years ago

This is only a warning message, not error - please check it again.

On Fri, Sep 15, 2017 at 12:29 PM, golnazads notifications@github.com wrote:

@golnazads commented on this pull request.

In run.py https://github.com/adsabs/ADSimportpipeline/pull/161#discussion_r139193682 :

@@ -174,8 +174,8 @@ def main(*args):
 args = parser.parse_args()
if args.bibcodes:

args.bibcodes = [x.strip() for x in args.bibcodes.split(',')]

if args.bibcodes:

Steve should be able to explain this in details. But I just took the comment out and I get this error

Sorry, cant find the proj home; returning the location of the caller: /home/gshapuri/code/ADSimportpipeline/python/lib/python2.7/site-packages/ adsputils

and then it hangs. checked the /proj directory and it is accessible.

— You are receiving this because your review was requested. Reply to this email directly, view it on GitHub https://github.com/adsabs/ADSimportpipeline/pull/161#discussion_r139193682, or mute the thread https://github.com/notifications/unsubscribe-auth/AAZIknXk8TTqmZ609UFXHguO64P_eTjzks5siqXOgaJpZM4PGITT .

romanchyla commented 6 years ago

Hi @golnazads and @aaccomazzi : this PR is getting more complicated every day!

I'm reviewing the rules for pages and the solr schema and this is what I'm seeing:

we already have field "page" - it is a multivalued field, if you send it an array of ["20", "50-55"] it will index the first values (i.e. ["20", "50"]; this hack was put in place before we had python transformers, now I don't see a need for it. The work can be done on the python side. By the same token, I don't see a need to create a new field and send the same values (i.e. page_range; with just a different format - ie. '50-55')

I also see inconsistency in the stubdata - sometimes the page value is None, single string '50', or list of strings ['50']; and I don't see any checking of the type -- but this is very important; because if we send SOLR wrong data, then a document will be rejected

SO, this is what I NEED you to do, Golnaz:

review usage of 'page' fields (make sure that ADS Classic represents and returns them as list of strings, or just strings)
add into the static routines checks that you are sending the proper types (as defined by the solr schema - ie. send numbers for numbers, and strings https://github.com/golnazads/ADSimportpipeline/blob/321e0f092b97d11fbe32230c9203833854b68452/aip/libs/solr_adapter.py#L87)
clean up 'page' value, split it into ['first', 'last']

and DECIDE if we want to have the last_page and first_page instead + decide if we need page_range (because you can generate 'page' field that effectively has the range; ie. page: ['50', '80']; do we really need to duplicate what we store already?

In my opinion, these fields are not important for search; so I'd recommend to just generate page:[first, last] + page_count

Thanks!

golnazads commented 6 years ago

we already have field "page" I have not touch the definition of page, I have only made sure we are extracting a page value out of the raw_pub, if there are any. I modified the regular expression side only to make sure we have covered every possible case.

it is a multivalued field, if you send it an array of ["20", "50-55"] it will index the first values (i.e. ["20", "50"] yes I understand and again I did not touch this part either. page alway grabbed the starting page and I left that alone. I also understand that the page can contain multiple values, and one of the values in the page can be eid. again I left that as it was.

I don't see a need to create a new field and send the same values again I did not add page_range, it was there, this is not my addition 'page_range': '', // somewhat redundant, but there are weird cases (e.g. STI) where it's not just "firstpage-lastpage"

None, single string '50', or list of strings ['50']; yes, I return None if nothing is found or a page number. I don't think I have ever seen a list returned! Oh wait, I made the modification on the side of adspy. And yes I have seen duplicate code to grab page in ImportPipeline. I did not touch that either. All the work for extracting page on my part was done on the side of adspy. My modification on ImportPipeline was renaming the filed page_number to number_pages, and adding page_count.

aip/libs/enforce_schema.py — renamed pagenumber to number_pages to match adspy. aip/libs/solr_adapter.py — added page_count and page_range.

golnaz

On Mon, Sep 18, 2017 at 5:26 PM, Roman Chyla notifications@github.com wrote:

Hi @golnazads https://github.com/golnazads and @aaccomazzi https://github.com/aaccomazzi : this PR is getting more complicated every day!

I'm reviewing the rules for pages and the solr schema and this is what I'm seeing:

we already have field "page" - it is a multivalued field, if you send it an array of ["20", "50-55"] it will index the first values (i.e. ["20", "50"]; this hack was put in place before we had python transformers, now I don't see a need for it. The work can be done on the python side. By the same token, I don't see a need to create a new field and send the same values (i.e. page_range; with just a different format - ie. '50-55')

I also see inconsistency in the stubdata - sometimes the page value is None, single string '50', or list of strings ['50']; and I don't see any checking of the type -- but this is very important; because if we send SOLR wrong data, then a document will be rejected

SO, this is what I NEED you to do, Golnaz:

review usage of 'page' fields (make sure that ADS Classic represents and returns them as list of strings, or just strings)

add into the static routines checks that you are sending the proper types (as defined by the solr schema - ie. send numbers for numbers, and strings https://github.com/golnazads/ ADSimportpipeline/blob/321e0f092b97d11fbe32230c920383 3854b68452/aip/libs/solr_adapter.py#L87 https://github.com/golnazads/ADSimportpipeline/blob/321e0f092b97d11fbe32230c9203833854b68452/aip/libs/solr_adapter.py#L87 )

clean up 'page' value, split it into ['first', 'last']

and DECIDE if we want to have the last_page and first_page instead + decide if we need page_range (because you can generate 'page' field that effectively has the range; ie. page: ['50', '80']; do we really need to duplicate what we store already?

In my opinion, these fields are not important for search; so I'd recommend to just generate page:[first, last] + page_count

Thanks!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/adsabs/ADSimportpipeline/pull/161#issuecomment-330361204, or mute the thread https://github.com/notifications/unsubscribe-auth/AbbOCIsv0P62kESoOr0byIQ4hmmN6hoxks5sjt_1gaJpZM4PGITT .

golnazads commented 6 years ago

I guess I contradicted myself! In the comment section I say I added page_range. But that is wrong. I did not added page_range. I don't know why I said that.

sorry golnaz

On Mon, Sep 18, 2017 at 5:53 PM, Shapurian, Golnaz < golnaz.shapurian@cfa.harvard.edu> wrote:

we already have field "page" I have not touch the definition of page, I have only made sure we are extracting a page value out of the raw_pub, if there are any. I modified the regular expression side only to make sure we have covered every possible case.

it is a multivalued field, if you send it an array of ["20", "50-55"] it will index the first values (i.e. ["20", "50"] yes I understand and again I did not touch this part either. page alway grabbed the starting page and I left that alone. I also understand that the page can contain multiple values, and one of the values in the page can be eid. again I left that as it was.

I don't see a need to create a new field and send the same values again I did not add page_range, it was there, this is not my addition 'page_range': '', // somewhat redundant, but there are weird cases (e.g. STI) where it's not just "firstpage-lastpage"

None, single string '50', or list of strings ['50']; yes, I return None if nothing is found or a page number. I don't think I have ever seen a list returned! Oh wait, I made the modification on the side of adspy. And yes I have seen duplicate code to grab page in ImportPipeline. I did not touch that either. All the work for extracting page on my part was done on the side of adspy. My modification on ImportPipeline was renaming the filed page_number to number_pages, and adding page_count.

aip/libs/enforce_schema.py — renamed pagenumber to number_pages to match adspy. aip/libs/solr_adapter.py — added page_count and page_range.

golnaz

On Mon, Sep 18, 2017 at 5:26 PM, Roman Chyla notifications@github.com wrote:

Hi @golnazads https://github.com/golnazads and @aaccomazzi https://github.com/aaccomazzi : this PR is getting more complicated every day!

I'm reviewing the rules for pages and the solr schema and this is what I'm seeing:

we already have field "page" - it is a multivalued field, if you send it an array of ["20", "50-55"] it will index the first values (i.e. ["20", "50"]; this hack was put in place before we had python transformers, now I don't see a need for it. The work can be done on the python side. By the same token, I don't see a need to create a new field and send the same values (i.e. page_range; with just a different format - ie. '50-55')

I also see inconsistency in the stubdata - sometimes the page value is None, single string '50', or list of strings ['50']; and I don't see any checking of the type -- but this is very important; because if we send SOLR wrong data, then a document will be rejected

SO, this is what I NEED you to do, Golnaz:

review usage of 'page' fields (make sure that ADS Classic represents and returns them as list of strings, or just strings)

add into the static routines checks that you are sending the proper types (as defined by the solr schema - ie. send numbers for numbers, and strings https://github.com/golnazads/ADSimportpipeline/blob/321e0f09 2b97d11fbe32230c9203833854b68452/aip/libs/solr_adapter.py#L87 https://github.com/golnazads/ADSimportpipeline/blob/321e0f092b97d11fbe32230c9203833854b68452/aip/libs/solr_adapter.py#L87 )

clean up 'page' value, split it into ['first', 'last']

and DECIDE if we want to have the last_page and first_page instead + decide if we need page_range (because you can generate 'page' field that effectively has the range; ie. page: ['50', '80']; do we really need to duplicate what we store already?

In my opinion, these fields are not important for search; so I'd recommend to just generate page:[first, last] + page_count

Thanks!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/adsabs/ADSimportpipeline/pull/161#issuecomment-330361204, or mute the thread https://github.com/notifications/unsubscribe-auth/AbbOCIsv0P62kESoOr0byIQ4hmmN6hoxks5sjt_1gaJpZM4PGITT .

romanchyla commented 6 years ago

I have to decide on whether the changes fit with the existing ones (so sometimes adding something while there already exist a field is not the right thing to do). So in this case we need you to concentrate on the questions that were posed in the last section of the comment. Thanks.

golnazads commented 6 years ago

page_count did not exist in Solr. that is what I added.

We had page_number that never made it to Solr. I thought the name was confusing so I renamed page_number to number_pages. Now I am thinking why did not I rename it to page_count that would be consistent with Solr name.

golnaz

On Mon, Sep 18, 2017 at 6:07 PM, Roman Chyla notifications@github.com wrote:

I have to decide on whether the changes fit with the existing ones (so sometimes adding something while there already exist a field is not the right thing to do). So in this case we need you to concentrate on the questions that were posed in the last section of the comment. Thanks.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/adsabs/ADSimportpipeline/pull/161#issuecomment-330370166, or mute the thread https://github.com/notifications/unsubscribe-auth/AbbOCJHYN1tb5Lbij99RegLDiELrV5j5ks5sjumjgaJpZM4PGITT .

golnazads commented 6 years ago

So sorry I was right in the comment section. I have added page_range to Solr. As you mentioned page always grabs the starting page, and we already were filling page_range variable in the code, and Alberto reminded me that some of the export formats require page_range and so I added that in.

so sorry for the confusion at first on my part golnaz

On Mon, Sep 18, 2017 at 6:11 PM, Shapurian, Golnaz < golnaz.shapurian@cfa.harvard.edu> wrote:

page_count did not exist in Solr. that is what I added.

We had page_number that never made it to Solr. I thought the name was confusing so I renamed page_number to number_pages. Now I am thinking why did not I rename it to page_count that would be consistent with Solr name.

golnaz

On Mon, Sep 18, 2017 at 6:07 PM, Roman Chyla notifications@github.com wrote:

I have to decide on whether the changes fit with the existing ones (so sometimes adding something while there already exist a field is not the right thing to do). So in this case we need you to concentrate on the questions that were posed in the last section of the comment. Thanks.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/adsabs/ADSimportpipeline/pull/161#issuecomment-330370166, or mute the thread https://github.com/notifications/unsubscribe-auth/AbbOCJHYN1tb5Lbij99RegLDiELrV5j5ks5sjumjgaJpZM4PGITT .

aaccomazzi commented 6 years ago

For the record, the reason for the different fields is the following:

page - multivalued/string/searchable - should contain the first page as well as any electronic ids; you are correct that we should modify the python code to remove the "-" and last page since the only searchable content we are interested in is the start page number and electronic id. Useful for search and output of first page.
page_range - single valued/string - is a string denoting a human-readable range of pages, such as "5-10, 15, ix-xvii, p.1". Useful for output formats
page_count - integer/searchable - the number of pages in the document; it is called page_count rather than 'numpages' for consistency with the other fields (citation_count, read_count, author_count, etc). Useful both in output formats and to sort records

golnazads commented 6 years ago

if you send it an array of ["20", "50-55"] it will index the first values (i.e. ["20", "50"];

I actually do not think we send Solr a range. This is from ADSCachedExports.py. Seems that we separate the fields here and assign values to page and last page.

if 'electronic_id' not in publication_parsed_data:

    try:

        start_page_jp =

publication_parsed_data['page_range'].split('-')[0]

        xml_node(parent, 'page', start_page_jp)

    except:

        if start_page:

            xml_node(parent, 'page', start_page)

    try:

        last_page_jp =

publication_parsed_data['page_range'].split('-')[1]

        xml_node(parent, 'lastpage', last_page_jp)

    except:

        if last_page:

            xml_node(parent, 'lastpage', last_page)

On Mon, Sep 18, 2017 at 7:23 PM, Shapurian, Golnaz < golnaz.shapurian@cfa.harvard.edu> wrote:

-

page - multivalued/string/searchable - should contain the first page as well as any electronic ids; you are correct that we should modify the python code to remove the "-" and last page since the only searchable content we are interested in is the start page number and electronic id. Useful for search and output of first page.

I made sure dashes are not returned from adspy. Here is the code

if starting page is missing but the last page is available

we get an extra - so need to remove them. (i.e., Aviatsiya i

kosmonavtika, No. 9, p. -35, 40 - 41

returns -35,)

the same is true with the reverse, if first page is followed by -

but no last page (i.e.,

Science, Volume 334, Issue 6057, pp. 737- (2011).), it returns 737-,

and hence need to remove the right side -.

the hybrid regex is to match the following hybrid page number

formats:

page = page.strip(' ').strip('-').strip('+'). strip(',').strip('.').strip(';').replace('[','').replace(']','')

On Mon, Sep 18, 2017 at 7:14 PM, Alberto Accomazzi < notifications@github.com> wrote:

For the record, the reason for the different fields is the following:

-

page - multivalued/string/searchable - should contain the first page as well as any electronic ids; you are correct that we should modify the python code to remove the "-" and last page since the only searchable content we are interested in is the start page number and electronic id. Useful for search and output of first page.

page_range - single valued/string - is a string denoting a human-readable range of pages, such as "5-10, 15, ix-xvii, p.1". Useful for output formats

page_count - integer/searchable - the number of pages in the document; it is called page_count rather than 'numpages' for consistency with the other fields (citation_count, read_count, author_count, etc). Useful both in output formats and to sort records

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/adsabs/ADSimportpipeline/pull/161#issuecomment-330382455, or mute the thread https://github.com/notifications/unsubscribe-auth/AbbOCL-mMywYxFA4Mce-i13QvXBP5M6Dks5sjvlegaJpZM4PGITT .

romanchyla commented 6 years ago

OK, so I have made those edits to solr schema

page - multivalued string field, searchable, stored page_range - single valued string field, only stored page_count - integer, single value, searchable, stored

I'll also remove from solr the logic that was modifying the values in the 'page' field (we should do cleanup, normalization on the python side)

the page_count is the one that might cause most problems; solr has to receive an integer (so what happens for values as 'iv-xx'?)

romanchyla commented 6 years ago

@golnazads - could you please review also 'eid' and 'year' field? The same logic was applied to them on the solr side (split by '-' and index only the first token; so now we should do the work on the python side)

golnazads commented 6 years ago

the page_count is the one that might cause most problems; solr has to receive an integer (so what happens for values as 'iv-xx'?) page count is computed only if both the first and last page are numbers.

On Tue, Sep 19, 2017 at 1:27 PM, Roman Chyla notifications@github.com wrote:

@golnazads https://github.com/golnazads - could you please review also 'eid' and 'year' field? The same logic was applied to them on the solr side (split by '-' and index only the first token; so now we should do the work on the python side)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/adsabs/ADSimportpipeline/pull/161#issuecomment-330612604, or mute the thread https://github.com/notifications/unsubscribe-auth/AbbOCNrnfUz1syqp4lAxrOWBeYV7vT6zks5sj_mLgaJpZM4PGITT .

golnazads commented 6 years ago

ould you please review also 'eid' and 'year' field? eid can be something like 02B701-02B701-3. And I don't think we want this to be broken down. I could be wrong thought. Alberto or Carolyn should clarify this.

On Tue, Sep 19, 2017 at 2:49 PM, Shapurian, Golnaz < golnaz.shapurian@cfa.harvard.edu> wrote:

the page_count is the one that might cause most problems; solr has to receive an integer (so what happens for values as 'iv-xx'?) page count is computed only if both the first and last page are numbers.

On Tue, Sep 19, 2017 at 1:27 PM, Roman Chyla notifications@github.com wrote:

@golnazads https://github.com/golnazads - could you please review also 'eid' and 'year' field? The same logic was applied to them on the solr side (split by '-' and index only the first token; so now we should do the work on the python side)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/adsabs/ADSimportpipeline/pull/161#issuecomment-330612604, or mute the thread https://github.com/notifications/unsubscribe-auth/AbbOCNrnfUz1syqp4lAxrOWBeYV7vT6zks5sj_mLgaJpZM4PGITT .

aaccomazzi commented 6 years ago

eid should be indexed as a single-valued string, searchable. page_count is an integer by definition.

golnazads commented 6 years ago

started from the latest master repo a87c567 made the following changes aip/libs/enforce_schema.py — renamed pagenumber to number_pages to match adspy. aip/libs/solr_adapter.py — added page_count and page_range. tests/stubdata/stubdata.py - added page_count to stubdata tests/tests/test_solr_adapter.py - added page_count

coveralls commented 6 years ago

Changes Unknown when pulling cfe8e3a522ba2ea759835d35838562e19a938769 on golnazads:master into on adsabs:master.

adsabs / ADSImportPipeline

Added two fields page_count and page_range. #161

@golnazads commented on this pull request.

if args.bibcodes:

if starting page is missing but the last page is available

we get an extra - so need to remove them. (i.e., Aviatsiya i

returns -35,)

the same is true with the reverse, if first page is followed by -

Science, Volume 334, Issue 6057, pp. 737- (2011).), it returns 737-,

the hybrid regex is to match the following hybrid page number

page_range - single valued/string - is a string denoting a human-readable range of pages, such as "5-10, 15, ix-xvii, p.1". Useful for output formats