fasterius / VarClust

A Python package for clustering of single nucleotide variants from high-through seqencing data.
Other
7 stars 3 forks source link

Heatmap: location for distance matrix, path for output figure #6

Open Ascalon98 opened 5 months ago

Ascalon98 commented 5 months ago

Hi! I tried to use the heatmap function, but it did not work. I am not sure where the distance matrix is supposed to be, but it was in my miniconda3 directory, which I think is weird. Could you also tell me what the path for the output figure should be? It seems it is not just a simple jpg file.

(base) aimre@sisko:~/VarClust_data$ varclust_heatmap /home5/aimre/miniconda3/bin/varclust_distance_matrix /home5/aimre/Varclust_figure/heatmap1.jpg
Traceback (most recent call last):
  File "/home5/aimre/miniconda3/bin/varclust_heatmap", line 127, in <module>
    cluster.cluster_hierarchical(distances=distances,
  File "/home5/aimre/miniconda3/lib/python3.11/site-packages/varclust/cluster.py", line 148, in cluster_hierarchical
    colours['label'] = colours['index'].str.split(': ', 1).str[0]
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home5/aimre/miniconda3/lib/python3.11/site-packages/pandas/core/strings/accessor.py", line 137, in wrapper
    return func(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: StringMethods.split() takes from 1 to 2 positional arguments but 3 were given
(base) aimre@sisko:~/VarClust_data$ varclust_heatmap /home5/aimre/miniconda3/bin/varclust_distance_matrix /home5/aimre/miniconda3/bin/varclust_heatmap
Traceback (most recent call last):
  File "/home5/aimre/miniconda3/bin/varclust_heatmap", line 127, in <module>
    cluster.cluster_hierarchical(distances=distances,
  File "/home5/aimre/miniconda3/lib/python3.11/site-packages/varclust/cluster.py", line 148, in cluster_hierarchical
    colours['label'] = colours['index'].str.split(': ', 1).str[0]
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home5/aimre/miniconda3/lib/python3.11/site-packages/pandas/core/strings/accessor.py", line 137, in wrapper
    return func(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: StringMethods.split() takes from 1 to 2 positional arguments but 3 were given
(base) aimre@sisko:~/VarClust_data$ varclust_heatmap /home5/aimre/miniconda3/bin/varclust_distance_matrix /home5/aimre/Varclust_figure/heatmap1.jpg
Traceback (most recent call last):
  File "/home5/aimre/miniconda3/bin/varclust_heatmap", line 127, in <module>
    cluster.cluster_hierarchical(distances=distances,
  File "/home5/aimre/miniconda3/lib/python3.11/site-packages/varclust/cluster.py", line 148, in cluster_hierarchical
    colours['label'] = colours['index'].str.split(': ', 1).str[0]
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home5/aimre/miniconda3/lib/python3.11/site-packages/pandas/core/strings/accessor.py", line 137, in wrapper
    return func(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: StringMethods.split() takes from 1 to 2 positional arguments but 3 were given
fasterius commented 5 months ago

Could you please write down all of the commands that you have run, from the start, as well as where your data is stored? I think that if you've gotten results inside your Conda directory you've done something odd.

Ascalon98 commented 5 months ago

Hi! Yes indeed! I overwrote the distance matrix script somehow. So I downloaded the source code and copied the distance matrix script in the miniconda folder again, and then it worked! Thus now I could generate the the distance matrix and it is in one of my folders. I used this command for it:

(base) aimre@sisko:~/VarClust-0.2.3/bin$ varclust_distance_matrix /home5/aimre/Varclust_profiles/ /home5/aimre/Varclust_profiles/output_distance_matrix

However for the heatmap I am getting a similar error message. I am sorry, probably it is something very basic that I am missing. So I really appreciate your help!

(base) aimre@sisko:~/Varclust_profiles$ varclust_heatmap /home5/aimre/Varclust_profiles/output_distance_matrix /home5/aimre/Varclust_profiles/output_heatmap_figure Traceback (most recent call last): File "/home5/aimre/miniconda3/bin/varclust_heatmap", line 123, in cluster.cluster_hierarchical(distances=distances, File "/home5/aimre/miniconda3/lib/python3.11/site-packages/varclust/cluster.py", line 148, in cluster_hierarchical colours['label'] = colours['index'].str.split(': ', 1).str[0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home5/aimre/miniconda3/lib/python3.11/site-packages/pandas/core/strings/accessor.py", line 137, in wrapper return func(self, *args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ TypeError: StringMethods.split() takes from 1 to 2 positional arguments but 3 were given

Ascalon98 commented 5 months ago

I also add the information you asked previously: I entirely followed the instructions so far. I have single sample vcf.gz files in this folder /home5/aimre/VarClust_data/ and the generated profile files in this folder /home5/aimre/Varclust_profiles/.

Command for creating profiles: varclust_create_profiles /home5/aimre/VarClust_data/ /home5/aimre/Varclust_profiles/

I think the rest of the steps I have done is in my previous comment. Let me know if you need more information!

fasterius commented 4 months ago

Okay, strange. Do you have a way of sharing the vcf file you have, or at least a portion of it, so that I can test it on my end?

Ascalon98 commented 4 months ago

Yes! I attached a few of my vcf files. They are all single sample vcf files. I appreciate your help! The file names contain '.sam' because for some reason the GATK pipeline thought that it was part of the sample name and left it like this in the heading of the vcf files.

W1_1_S1L003.sam.vcf.gz W1_2_S8L003.sam.vcf.gz W1_3_S15L003.sam.vcf.gz W1_4_S22L003.sam.vcf.gz

Ascalon98 commented 4 months ago

Update: In the meantime I fixed the issue. I found the solution here: https://stackoverflow.com/questions/76812405/typeerror-stringmethods-rsplit-takes-from-1-to-2-positional-arguments-but-3-w

So I just corrected str.split(': ', 1) to str.split(': ', n=1) and then it worked.

However, I noticed that the generated figure is not informative, and it seems according to VarClust there is no detectable difference between my samples. Also, in the distance matrix all the comparisons received the same value (0.8333), which to me does not seem realistic. So, I opened my profile files, and it turned out they were empty, there is only a headline. I will try to figure out the reason of this, but if you have any idea, I would genuinely appreciate it!

Here is the figure that VarClust generated. output_heatmap_figure

fasterius commented 3 months ago

Sorry for the late response, I've been on vacation and away from the computer.

Okay, it sounds like your VCFs are malformed somehow. The value 0.8333 is the expected value to get when you're comparing empty profiles with the similarity score and default parameters: similarity score = 1 - (matches + a) / (total + a + b) where a = 1 and b = 5 as default. This yields a score of 1 - 1/6 = 0.8333.

Could you try passing the --method position_only argument? Looking at your VCFs it seems you do not have annotations from SnpEff, so the default full profiles won't work.

I can also see that your chromosomes are named with Roman numerals instead of normal numbers. I'm not sure this is supported by the underlying PyVCF package.

Ascalon98 commented 3 months ago

Hi! Thank you for your answer! I added snpEff annotations but, when I wanted to build the profiles, I received error messages. Without snpEff annotations the command makes empty profiles. Can you send over one of your vcf.gz files that works for you for comparison?

I tried the --method position_only argument like you see below, but it did not work either. Can you elaborate on the usage of this argument?

varclust_create_profiles /home5/aimre/VarClust_data/ /home5/aimre/VarClust_data/snpEff_annotated/varclust_probe/probe_profiles/ --method position_only

fasterius commented 2 months ago

Could you try this file: https://github.com/fasterius/seqCAT/blob/master/inst/extdata/sample1.vcf.gz?

The --method position_only argument makes comparisons happen based on SNV positions only, rather than based on position and annotations. Since you don't have annotations from snpEff you should be using positions only.