Azure-Samples / Synapse

Samples for Azure Synapse Analytics
MIT License
375 stars 351 forks source link

The information on environment construction is incorrect #55

Closed ymasaoka closed 3 years ago

ymasaoka commented 3 years ago

Hello,

It is presumed that the content of "Let's get the environment ready" described in the following document is incorrect.

Creating requirements.txt as instructed and applying it to the Spark pool does not install pymongo. Also, there was no mention of pymongo in the output result of the libraries described in the document.

I'm asking a similar question to Microsoft Q & A and I'm currently investigating the cause.

Please confirm and investigate.

Thanks,

ymasaoka commented 3 years ago

Hello Reviewers,

I modified the Readme and .ipynb files to add to the content of the pull request I have already sent. Please confirm and merge.

Thanks,

Rodrigossz commented 3 years ago

Hello @ymasaoka

We published a new version. Do you want to check it out?

ymasaoka commented 3 years ago

Hello @Rodrigossz ,

Of course I will! If you give me a few days, I will comment on the results I checked here again! Thank you for responding!

ymasaoka commented 3 years ago

Hello @Rodrigossz ,

Sorry I made you wait. I checked the contents. I have some concerns, could you check it?

Insert another dataset, but this time using the MongoSpark connector.

Do you use MongoSpark connector? For the two data inserts, both appear to be using db of the MongoClient.

Create a collection named HTAP with a partition key called item.

The name Shard key is used on the Cosmos DB Data Explorer. The name Shard key is also used for MongoDB. I was worried because there is no sense of unity in this part.

Thanks,

Rodrigossz commented 3 years ago

Good catches. WIll fix it. TKS

ymasaoka commented 3 years ago

Also, some of the PySpark syntax is incorrect, could you please correct it?

cell 12

# df.groupBy(df.item.string).sum().show() # incorrect
df.groupBy(df['item']).sum().show() # correct

cell 13 & cell 18

# df.printSchema # We can confirm it, but the syntax is not accurate.
df.printSchema() # correct

I was trying out the content right now, but it seems that I got caught in a bug in Azure Synapse Link for Azure Cosmos DB. I have not yet confirmed the part where the schema information of the timestamp of the new content is updated. However, I have confirmed that this part is the same as the previous content, and since I was able to confirm the operation last time, I think that there is no problem. (I just gave feedback to the Azure Synapse Link team.)

スクリーンショット 2020-10-24 3 41 14

Rodrigossz commented 3 years ago

I fixed the typos and changed the schema command. It was not wrong as you mentioned, but for sure your suggestion returns a more elegant view. About aggregations, what you wrote that the existing command is incorrect and suggested another one. But that's not true, the existing syntax is the correct one. But based on your mistake, I added an explanation about it.

The PR is here: https://github.com/Azure-Samples/Synapse/pull/58