atulkum / pointer_summarizer

pytorch implementation of "Get To The Point: Summarization with Pointer-Generator Networks"
Apache License 2.0
907 stars 242 forks source link

Decode custom texts with a pretrained model #38

Open timoderbeste opened 5 years ago

timoderbeste commented 5 years ago

I have some texts that I would like to summarize using this model. Each text is stored in a json file with the following format. It has an empty id and abstract field and the article field corresponds to an array of tokenized strings/sentences.

{"id": "", "abstract": [], "article": ["Hyperconvergence has come a long way in a relatively short time , and enterprises are taking advantage of the new capabilities .", "Hyperconverged infrastructure ( HCI ) combines storage , computing and networking into a single system ; hyperconverged platforms include a hypervisor for virtualized computing , software-defined storage , and virtualized networking .", "HCI platforms were initially aimed at virtual desktop infrastructure ( VDI ) , video storage , and other discrete workloads with predictable resource requirements .", "Over time , they have advanced to become suitable platforms for enterprise applications , databases , private clouds , and edge computing deployments .", "Learn more about hyperconvergence A couple of key developments have made HCI more appealing for more workloads .", "One is the ability to independently scale compute and storage capacity , via a disaggregated model .", "The other is the ability to create a hyperconverged solution using NVMe an open logical device interface specification for accessing non-volatile storage media attached via a PCI Express bus over fabrics .", "In general , there is a greater understanding of the value proposition of HCI , specifically for smaller enterprises that may not need [ or ] want a full-scale data center infrastructure , but want to retain some control over their environments , says Sebastian Lagana , research manager , infrastructure platforms and technologies , at research firm IDC .", "The increasing use of hybrid cloud environments by enterprises also lines up nicely with the software-defined data center story , which HCI is certainly a large part of , Lagana says .", "HCI has become a suitable platform for broader use due to a lot of the underlying improvements in the technology , Lagana says .", "At the same time , many enterprises have gone through an IT refresh cycle and HCI seems like a natural transition .", "Weve spoken with some HCI adopters and , in some cases , folks were talking to are upgrading multiple generation-old infrastructure running on old , sometimes now unsupported software , Lagana says .", "At that point , if the old server and/or storage technology theyre using is that far behind whats now available , it becomes a matter of the level of complexity theyre seeking in their new environment .", "HCI has the required horsepower while providing a user-friendly management interface , Lagana says .", "Could you run faster with a highly customized solution ?", "he says .", "Sure , but in many cases its not worth the extra effort when the HCI solution will suffice and provides good long-term scalability .", "Among the key benefits organizations can see from deploying HCI more broadly are greater consolidation and simplification of the IT infrastructure , which allows IT teams to better support business objectives , Lagana says .", "Other possible benefits include faster helpdesk response times , proactive understanding of potential hardware failures , the ability to quickly spin up new servers or test environments , faster disaster recovery and easier backup features .", "There are also the more mechanical benefits , Lagana says .", "Hardware consolidation provides power , cooling and facilities cost savings , which is easy to measure and is an easy sell to less tech-savvy budget holders , he says .", "Also , HCI and the underlying software makes it easier to maximize utilization of existing resources , which reduces longer-term storage and server expenses as well .", "HCI deployment scales as business expands Celtic Manor Collection , a resort hotel and conference center operator , has been using two clusters of Dell EMCs VXrail HCI appliance , beginning in September 2017 .", "Among the initial business drivers for deploying HCI was that Celtic Manor was embarking on a new joint venture to build an international convention center in Wales , says Chris Stanley , IT manager .", "The project required the flexibility to scale systems quickly , the ability to easily manage and maintain data center capacity with a small team , the ability to respond quickly to any outages in service , and resiliency to avoid any downtime for large-scale events at the convention center .", "Celtic Manor previously had an environment that included storage-area networks ( SAN ) and VMware ESXi servers , but it was taking a lot of resources to maintain , upgrade , and troubleshoot , Stanley says .", "The business was growingand still israpidly and bursting at the seams with data , he says .", "We needed a complete rethink to prepare the data center for the future and simplify management .", "Initially the company was deploying the clusters as separate data centers for different business entities .", "When we deployed our second cluster we quickly realized we could do more if the two were able to connect over the network together , Stanley says .", "As of today , we now have our core business systems split between the two clusters , with all off these having a recover point copy on the opposite cluster .", "So we now have full cluster failover if required , [ which ] gives us a lot of peace of mind as a business .", "HCI has become the core tech in our business , Stanley says .", "With our planned business expansion of several new hotels in the next two years , we have a template with predictive costs and scalability .", "The company uses HCI for its main enterprise applications , which run on large Oracle and SQL databases .", "These are using less resources than when they were in their previous environment , and we regularly monitor these to see if any servers are over provisioned , Stanley says .", "Celtic Manor is preparing to roll out VDI , with up to 450 endpoints added over the next 12 to 18 months .", "With our business growing , we are looking to potentially use the HCI clusters for cloud and remote deployment for our new hotels , Stanley says .", "VXrail has given us a solid flexible platform to grow our business .", "What has enabled an expanded role for HCI are developments in NVMe over fabrics , with CPUs having a smaller workload intensity , and greater amounts of input/output operations per second ( IOPS ) being achieved on a regular basis , Stanley says .", "With demands on data center performance growing to process and store vast amounts of data every second , it is great timing for the hyperconverged market to make its mark , Stanley says .", "Among the key benefits of HCI thus far are less time spent by the IT team on upgrading and maintaining the data center ; improved application performance ; and a 10 % reduction in data center power consumption .", "HCI powers county 's core apps and services Also expanding its use of HCI is the County of San Mateo , Calif. , which began using Nutanixs HCI platform in 2014 .", "We originally looked at the HCI solution to solve performance issues with our VDI deployment on VMwares Horizon platform , says Jon Walton , CIO .", "We had unsuccessfully tried to use EMC , Dell , and NetApp storage on blade servers , but kept running into high latency issues , especially as users logged into their sessions .", "After initial successes with VDI , county officials began to consider using the Nutanix HCI platform for all of its virtual workloads .", "The timing was perfect , as we were starting to virtualize more and more workloads , Walton says .", "In the last two years , the county has moved all its heavier workloads running Microsoft SQL and Oracle to dedicated Nutanix clusters .", "Most recently , it moved its countywide voice-over-IP implementation to two dedicated Nutanix clusters running Avaya Aura on VMware ESXi .", "There have been constant improvements on every level with HCI , Walton says .", "Shortly after we adopted Nutanix , they came out with one-click software upgrades .", "Through their HTML5 interface , we can upgrade every element of our virtual stackdisk firmware , BIOS , Nutanix AOS , Nutanix health check and VMware ESXiwith zero downtime and almost zero interaction .", "San Mateo has already converted 99 % of its Oracle and MS SQL applications to the HCI environment .", "It is also leveraging Nutanixs Protection Domain replication service for remote sites to provide high availability within county data centers , Walton says .", "With HCI , instead of spending all our time reacting to problems and resource constraints , we now have the time to research smart technology choices for the county , Walton says .", "Additionally , we no longer must rely on a small group of SMEs [ subject matter experts ] to provide expertise around storage and servers , as Nutanix takes care of it for us .", "County residents who rely on a variety of services have also seen benefits .", "They dont know or care what we run on , they just know it is fast and has had almost zero downtime in five-plus years , Walton says .", "Hyperconvergence market trends Demand for HCI and for data center convergence in general is on the rise .", "A recent report by research firm IDC shows that worldwide converged systems market revenue increased 10 % year over year to $ 3.5 billion during the second quarter of 2018 .", "HCI products helped to drive second quarter market expansion , the study said , thanks in part to their ability to reduce infrastructure complexity , promote consolidation , and allow IT teams to support an organization 's business objectives .", "Revenue from hyperconverged systems sales grew 78 % year over year during the second quarter , generating $ 1.5 billion worth of sales .", "This amounted to 41 % of the total converged systems market , the report said .", "IDC provides two ways to rank technology suppliers within the hyperconverged systems market , in terms of market share .", "One is by the brand of the hyperconverged platform and the other is by the owner of the software providing the core hyperconverged capabilities .", "For brand , those with the highest share are Dell , Nutanix , Cisco , and HPE .", "In terms of HCI software , the leaders are Nutanix , VMware , Dell , Cisco , and HPE .", "As for future developments in the hyperconvergence market , one of the growing trends is NVMe-based HCI , Lagana says .", "Were seeing flash as a major adoption driver , not just in HCI but in broader converged infrastructure and storage markets , and NVMe is the next step in that evolution , he says .", "Join the newsletter !", "Error : Please check your email address ."]}

To adjust the code so that my new format can be used, I tried to do the follow.

  1. I modified example_generator in data_util/data.py as follows so that my all my json files in the data_path directory can be read.

    
    def example_generator(data_path, single_pass):
    while True:
    print "Starting to generate examples"
    files = find_all_files(data_path)
    assert files, ('Error: No file at %s' % data_path)
    
    if single_pass:
      files = sorted(files)
    else:
      random.shuffle(files)
    
    for f in files:
      print "Example generated"
      with io.open(f, 'r', encoding='utf-8') as fp:
        content = json.load(fp)
        if 'article' not in content.keys():
          continue
        abstract = ''
        article = ' '.join(content['article'])
        yield abstract.lower(), article.lower()
    
    if single_pass:
      print "example_generator completed reading all datafiles. No more data."
      break

def find_all_files(data_path): print "Finding all files" os.chdir(data_path) files = [] for f in glob.glob('*.json'): files.append(data_path + f)

print "There are in total %d files" % len(files) return files

2. I modified `text_generator` in data_util/batcher.py so that it utilizes the tuple my `example_generator` yields instead of the tensorflow example object. 
```py
    def text_generator(self, example_generator):
      while True:
        example = example_generator.next()
        try:
          article_text = example[1]
          abstract_text = example[0]
        except ValueError:
          tf.logging.error('Failed to get article or abstract from example')
          continue

        if len(article_text) == 0:
          tf.logging.warning('Found an example with empty article text. Skipping it.')
          continue
        else:
          yield article_text, abstract_text

After the above two steps, I modified data_util/config.py so that the data paths are set correctly.

I utilized the vocab file located in the finished_files directory that came with the cnn daily mail dataset.

Lastly, I run start_decode.py with a model after 500k iterations and expected it would work.

However, I got this error with pointer generator turned off.

Traceback (most recent call last):
  File "/root/miniconda3/envs/py27/lib/python2.7/runpy.py", line 174, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/root/miniconda3/envs/py27/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/gluster/projects/pointer_summarizer_decode/training_ptr_gen/decode.py", line 210, in <module>
    beam_Search_processor.decode()
  File "/gluster/projects/pointer_summarizer_decode/training_ptr_gen/decode.py", line 84, in decode
    (batch.art_oovs[0] if config.pointer_gen else None))
  File "data_util/data.py", line 178, in outputids2words
AssertionError: Error: model produced a word ID that isn't in the vocabulary. This should not happen in baselin
e (no pointer-generator) mode

I tried to print out the variable output_ids in the function decode in training_ptr_gen/decode.py and got the following.

[9223372034707292159, 0, 9223372034707292159, 2, 9223372034707292159, 2, 9223372034707292159, 2, 9223372034707292159, 2, 9223372034707292159, 2, 9223372034707292159, 2, 9223372034707292159, 2, 9223372034707292159, 2, 9223372034707292159, 2, 9223372034707292159, 0, 9223372034707292159, 9223372034707292159, 2, 9223372034707292159, 2, 0, 9223372034707292159, 9223372034707292159, 9223372034707292159, 0, 9223372034707292159, 0, 9223372034707292159, 3]

As you can see, there are many repetitive 9223372034707292159's and I have no clue how they came about.

Is it possible at all to use a model trained on cnn-daily mail dataset to summarize some other third party texts stored in files with different formats?

If so, did I do something wrong to prepare for the examples?

Thank you so much for your help!